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In  Input/Output  bound  computer  systems  which  use  disks  for  back  up 
memory,  rotational  latency  is  frequently  the  source  of  the  i/o  boundness. 
This  paper  presents  several  methods  for  eliminating  rotational  latency  in 
head-per-track  disks  or  equivalent  memory  devices.  These  methods  assume 
pre -run  time  knowledge  of  the  i/o  sequences  required  by  a  particular  program 
which  will  be  run  on  the  system.   They  work  by  constantly  reorganizing  the  data 
on  the  disk  during  execution  of  the  program.   Measures  are  obtained  of 
certain  system  parameters  which  indicate  the  requirements  for  use  of  these 
techniques  and  the  various  tradeoffs  may  be  examined  and  quantified.  These 
results  are  useful  both  for  designing  a  new  operating  system  and  for  removing 
latency  from  an  already  existing  operating  system. 
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1.   INTRODUCTION 

1.1  The  Problem 

This  thesis  is  concerned  with  the  elimination  of  rotational  latency 
in  several  classes  of  i/O  bound  programs.  Within  the  context  of  this  paper, 
i/O  boundness  is  that  situation  in  which  a  processor  requires  data  faster 
than  the  system's  back  up  memory  can  supply  same  using  a  simple  on-demand 
strategy.   This  situation,  therefore,  is  memory,  processor  and  program  de- 
pendent and  is  a  property  of  all  three  of  these.   Thus,  a  program  may  be  i/O 
bound  for  some  particular  system  (memories  and  processor)  but  not  for  some 
other.   The  term  i/O  bound  program  will  implicitly  refer  to  both  a  program 
and  some  particular  system. 

The  literature  contains  a  succession  of  drum  latency  minimization 
techniques,  starting  with  the  IBM-65O  and  IBM-650-like  processors  [6,7,9]* 
In  these  machines,  all  memory  references  including  instruction  sequencing 
were  potential  sources  of  rotational  latency.   The  more  recent  literature 
deals  more  with  the  statistical  nature  of  requests  for  data  from  a  drum  and/ 
or  the  retrieval  of  data  (records,  pages,  etc.)  already  stored  there  in  some 
non- optimal  manner  [l,^-,5]. 

In  this  paper  we  deal  with  a  predetermined  sequence  of  demands  to 
the  drum  and  assume  that  the  initial  configuration  of  these  data  on  the  drum 
is  established  as  an  optimal  one.   Subsequent  positioning  of  data  on  the  drum 
(for  example  during  output)  is  such  that  the  locations  continue  to  be  optimal 
with  respect  to  future  retrieval  from  the  drum.   This  approach  was  taken  for 
a  limited  class  of  problems  in  [3]* 

The  solution  of  an  i/O  bound  program  will  consist  of  an  algorithm 
which  provides  a  particular  sequence  of  i/o  transactions  between  memories 


such  that  rotational  latency  is  eliminated.   Included  in  such  a  solution  is 
a  program  and  processor  independent  upper  bound  on  the  size  of  buffer  memory 
required.   Thus  the  results  presented  here  are  useful  either  in  programming 
for  an  existing  system  or  as  system  design  parameters.   The  data  of  such  an 
i/O  bound  program  is  assumed  to  be  stored  on  a  sufficiently  large  periodically 
addressable  device  such  as  a  head-per-track  disk,  a  drum,  a  recycling  delay 
line  or  shift  register,  or  some  other  device  whose  only  latency  is  rotational 
in  nature.   This  eliminates  disks  with  head  positioning  latency  but  some  of 
the  results  obtained  here  may  be  extended  to  include  these  devices. 

All  of  the  zero  latency  schemes  given  in  this  paper  will  require 
disks  which  are  capable  of  both  reading  and  writing  at  the  same  time.   That 
is,  at  any  point  in  time  while  the  program  is  being  executed,  data  may  be 
read  from  the  disk,  being  written  to  the  disk,  neither  of  these  or  both  of 
these. 

While  this  requirement  is  somewhat  artificial,  it  permits  the 
analysis  of  the  data  transmissions  in  the  methods  given  here  to  remain  more 
comprehensible.   It  is  also  the  case  that  several  disk  manufacturers  have 
indicated  that  such  disks  may  be  produced  for  no  more  than  a  10-20$  additional 
cost  over  their  single  i/O  counterparts.  Furthermore,  Chapter  5  gives  a 
method  by  which  standard  disks  without  the  concurrent  read/write  ability  may 
simulate  the  other  kind  entirely  through  software  techniques . 

In  a  uniprogramming  situation,  which  this  paper  is  concerned  with, 
I/O  bound  programs  cost,  in  time,  an  average  of  half  a  disk  rotation  plus 
transmission  time  for  each  access.   If  the  program  is  long  running,  as  in 
the  case  of  one  which  iterates  many  times  over  most  or  all  of  its  data  base, 
the  total  latency  becomes  quite  large.   Such  a  situation  may  easily  give  rise 
to  more  latency  time  than  actual  process  time. 


For  many  programs,  however,  it  is  clearly  the  case  that  with  some 
forethought  and  planning  the  data  could  actually  be  placed  at  the  proper 
locations  on  the  disk  to  eliminate  this  latency  for  at  least  the  first  pass 
through  the  data.  That  is,  for  the  first  time  the  data  are  accessed  from  the 
disk,  they  may  be  placed  in  those  locations  on  the  disk  from  which  they  may 
be  read  at  precisely  the  proper  moment  to  be  transferred  to  the  processor 
when  they  are  needed  by  the  processor.  Indeed,  a  fairly  simple  algorithm 
will  accomplish  this  effect  with  less  precise  requirements:  data  are  pre- 
fetched from  the  disk  in  the  order  in  which  they  will  be  required  and  held  in 
a  buffer  until  required.  This  will  yield  a  zero  latency  data  accessing 
sequence  for  the  first  pass  through  the  data,  although  it  is  not  obvious  what 
the  size  of  buffer  must  be  to  accomplish  this.  For  subsequent  passes,  how- 
ever, it  seems  that  somewhat  more  sophisticated  algorithms  may  be  used  to 
accomplish  the  same  effect.  These  would  insure  that  all  necessary  data  can 
be  pre-fetched  in  the  allotted  time  and  further  insure  that  there  is  a  fixed 
and  reasonable  bound  on  the  size  of  the  buffer  which  is  required  for  the  pro- 
cedure . 

This  paper  exploits  the  above  premise.  We  show  that  for  some 
classes  of  programs  there  exists  an  initial  disk  layout  of  the  data  base 
which  provides  a  zero  latency  initial  pass  of  the  program  through  its  data. 
We  further  provide  algorithms  which  reformat  the  disk  (in  real-time)  to  pro- 
vide zero  latency  data  accesses  from  the  disk  for  subsequent  passes  over  the 
data  base.  This,  in  effect  accomplishes  a  continuous  dynamic  allocation  of 
the  disk. 

One  requirement  must  be  met  before  any  methods  presented  here  may 
be  applied.  The  processor,  the  program  to  executed  by  it  and  the  disk  which 


will  be  used  must  be  well  matched.   This  simply  means  that  the  maximum  data 
rate  requirements  of  the  program  (executed  on  the  particular  processor)  must 
be  satisfiable  by  the  disk.   In  other  words,  the  program/processor  should 
not  crunch  numbers  faster  than  the  disk's  ability  to  provide  them.   Unless 
this  criterion  is  met,  there  can  be  no  hope  of  a  zero  latency  solution  of 
the  type  derived  here. 

A  program  consists  of  one  or  more  computational  processes  which 
operate  on  the  data  base  of  that  program.  A  computational  process  will  re- 
quire input  data  in  a  predictable  sequence.   It  is  assumed  that  the  execution 
time  of  each  computational  process  is  known  to  within  a  relatively  small 
tolerance.   This  last  assumption  permits  the  prediction  of  data  requirement 
timing  as  well  as  sequencing.   The  solutions  which  are  outlined  here  all  have 
the  property  that  the  disk  and  processor  remain  synchronized  with  each  other 
(cf .  section  5.6) . 

The  extraction  of  computational  processes  and  their  relevant  para- 
meters is  expected  to  take  place  at  compile  time  although  some  of  the  parameters 
may  be  obtained  at  run  time.   In  any  event,  it  is  this  "early"  information 
which  will  permit  most  of  the  latency  eliminating  methods  presented  here  to 
work. 

This  information  may  be  obtained,  in  some  cases,  by  a  fairly  straight- 
forward scanner  to  determine  data  sequences.   For  complicated  programs, 
however,  an  analyzer'  of  the  type  described  in  [10]  may  be  suitably  modified 

to  yield  this  information. 

The  data  are  assumed  to  be  lumped  into  blocks .   Block  size  is 
chosen  on  the  basis  of  several  considerations.  A  block  size  smaller  than  the 
smallest  addressable  segment  on  the  disk  is  inefficient  of  disk  capacity. 


Furthermore,  since  many  algorithms  require  many  variables  in  primary  memory 
at  one  time  (as  in  the  case  of  partial  differential  equation-type  mesh 
problems  where  neighborhoods  of  variables  are  computed  upon  at  the  same  time), 
very  small  blocks  are  uneconomical.   This  is  because  certain  variables  known 
as  edge  values  are  stored  redundantly.   The  number  of  such  edge  values  is 
directly  proportional  to  the  number  of  boundaries  between  blocks.   Thus  a 
partitioning  into  many  small  blocks  will  produce  many  redundantly  stored 
values.   On  the  other  hand,  there  will  typically  be  only  a  small  number  of 
blocks  in  memory  at  one  time  (interaction  between  data  is  assumed  to  be 
either  within  blocks  or  between  only  a  few  blocks  over  any  time  interval) . 
And  because  the  size  (and  hence  cost)  of  primary  memory  scales  linearly  with 
block  size,  this  mitigates  against  very  large  blocks.   Thus  the  block  size 
is  determined  as  a  function  of  the  primary  memory  (processor's  memory),  the 
disk's  characteristics  and  the  program  which  will  be  running.   Chapter  5 
contains  a  discussion  of  block  size  determination. 

The  reader  may  wish  to  substitute  the  work  "page"  for  "block" 
wherever  it  appears  here.   The  common  use  of  "page"  is  consistent  with 
"block"  here  except  that  the  preceding  remarks  about  size  determination 
should  be  heeded. 

This  paper  will  give  algorithms  which  produce  zero  latency 
solutions  for  particularly  well-defined  classes  of  programs.   Generalizations 
are  also  given  which  permit  application  of  these  same  methods  to  somewhat 
different,  less  restrictive  types  of  programs.   These  generalizations,  while 
less  formal  than  the  original  formulations,  will  hopefully  suggest  additional 
application  of  the  basic  methods  to  latency  elimination  in  programs. 


1.2  Terminology  and  Notation 

A  disk  is  any  large  capacity  rotating  storage  device  whose  latency  is 
solely  rotational.   Thus,  drums  and  recycling  shift  registers  are  disks. 

Latency  is  rotation  latency,  or  the  time  spent  "waiting"  for  a  disk 
to  rotate  so  that  the  head  is  correctly  positioned  to  begin  reading  the  next 
needed  data. 

A  physical  track  of  a  disk  is  a  recording  track  on  a  disk  which  is 
accessed  by  a  single  read  and/or  write  head.   It  may  be  thought  of  as  be- 
ginning at  some  arbitrary  angular  origin  and  extending  for  a  full  rotation  of 
the  disk. 

A  logical  track  of  a  disk  is  a  simulated  physical  track  of  arbitrary 
length.   That  is,  it  is  a  recording  track  which  may  extend  for  more  than  a 
single  disk  rotation.   The  simulation  is  accomplished  by  switching  read  or 
write  heads  from  one  physical  track  to  another  at  successive  occurrences  of 
the  origin  at  the  heads.   Intuitively,  a  logical  track  is  several  physical 
tracks  laid  out  end-to-end. 

Density  is  a  measure  of  the  amount  of  data  stored  on  some  portion  of 
a  disk.   It  is  usually  expressed  in  bits  per  physical  track  or  words  per 
physical  track  although  these  units  are  more  commonly  referred  to  as  bits  per 
rotation  or  words  per  rotation,  respectively.   The  maximum  density  for  a  disk 
is  simply  that  quantity  of  bits  or  words  per  track  which  is  the  largest  that 
may  be  recorded  on  a  single  physical  track  of  that  particular  disk,  due  to 
its  physical  properties. 

is  that  amount  of  data  (in  either  words  or  bits)  which  may  occupy 


a  physical  track  on  a  disk  at  its  maximum  density. 


Q  is  that  amount  of  data  which  occupies  a  physical  track  on  a  disk. 
If  the  density  of  the  data  under  consideration  happens  to  be  stored  at 
maximum  density,  Q  =  Q   .  This  quantity  of  data  is  frequently  referred  to 
as  "a  rotation  of  data." 

T  is  the  disk  period.   Time  is  frequently  expressed  in  terms  of  this 

quantity.  That  is,  we  frequently  measure  time  in  disk  rotations .  A  disk 

rotation  is  T ,  seconds . 
d 

The  data  base  is  the  collection  of  all  data  associated  with  a  program. 
It  includes  all  initial  data  as  well  as  intermediate  results  and  final  re- 
sults. 

A  data  entity  is  some  distinguishable  portion  of  a  program's  data  base. 
Data  entities  are  mutually  disjoint.  An  array  is  an  example  of  a  data  entity. 

A  program  during  its  execution  will  require  input  data.   In  the 
schema  here,  this  is  in  the  form  of  input  blocks .  We  say  that  computational 
processes  compute  on  their  input  data.  Results  of  computational  processes 
are  output  data.   This  is  handled  in  the  form  of  output  blocks. 

A  c omput at i ona 1  process  is  a  distinct  computational  portion  of  a 
program.   It  is  assumed  to  be  devoid  of  I/O  statements  and  all  conditional 
branches  except  those  which  provide  for  looping  or  are  otherwise  predictable. 
The  input  and  output  sequences  for  these  processes  are  generated  by  the  al- 
gorithms presented  in  this  paper. 

A  computational  process  which  only  does  updating  of  values  in  a  data 
entity  will  have  both  its  input  blocks  and  output  blocks  contained  in  that 
same  data  entity.  A  process  which  performs,  say,  a  binary  operation  on  two 
matrices  to  yield  a  third  one  will  have  its  input  blocks  coming  from  two 
entities.  Its  output  blocks  will  be  contained  in  yet  a  third.  A  particular 


computational  process  may  or  may  not  operate  over  its  program's  entire  data 
base  but  it  will  usually  operate  over  an  entire  data  entity  if  it  operates 
on  any  part  of  it.   This  is  because  a  computational  process  is  usually 
assumed  to  operated  iteratively  over  an  entire  entity.  We  frequently  refer 
to  a  computational  process  as  simply  a  process  where  no  ambiguity  will  result. 

Primary  memory  is  the  fastest  memory  available  to  the  processor  (save 
perhaps  a  limited  number  of  high  speed  registers).   It  is  random  access  and 
contains  the  current  data  being  computed  upon  (with  the  exception  of  perhaps 
small  amount  in  registers).   Because  processes  will  access  their  input  data 
from  primary  memory  it  is  frequently  referred  to  processor  memory . 

Primary  memory  need  contain  only  "a  few"  blocks  of  data  at  any  point 
in  time.   I.e.  primary  memory  may  be  as  small  as  3  blocks  of  data  plus  any 
needed  to  store  the  program  itself.   Thus  this  memory  is  relatively  small. 
A  discussion  of  block  size  appears  in  Chapter  5- 

Buffer  memory  is  a  typically  large,  slow  memory  with  respect  to 
primary  memory.   It  is  also  random  access  memory  and  is  used  to  match  data 
rates  between  the  disk  and  the  primary  memory  and  also  as  a  data  reservoir 
in  which  data  are  reformatted.   Buffer  memory  is  frequently  considered  to  be 
logically  partitioned  into  two  areas.   The  input  buffer  contains  data  on  its 
path  from  the  disk  to  the  primary  memory  whereas  the  output  buffer  contains 
data  on  its  path  from  the  primary  memory  to  the  disk.   Output  data  will  pass 
through  the  output  buffer  to  the  disk.  When  they  are  next  retrieved,  they  are 
input  data  and  pass  through  the  input  buffer  to  the  primary  memory. 

A  system  is  a  processor,  a  primary  memory,  a  buffer  memory  and  a  disk 
in  the  following  configuration: 
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The  relevant  parameters  are  the  disk  period,  T  ,  the  maximum  data  rate  from 
the  disk,  D,  the  size  and  bandwidth  of  the  buffer  memory,  B  ,  and  the  data 
word  consumption  rate  of  the  processor.  The  primary  memory  is  assumed  to  be 
matched  to  the  processor.  The  bandwidth  of  the  buffer  memory  must  be  at 
least  as  large  as  that  of  the  disk. 
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2.   MESH  PROBLEMS 

2.1  Introduction 

This  chapter  is  concerned  with  a  particular  class  of  programs  and 
the  elimination  of  rotational  latency  from  their  execution.   Several  of  the 
results  presented  here  will  be  superseded  in  later  chpaters  but  are  nevertheless 
interesting. 

The  problem  class  described  here  is  fairly  common,  representing  a 
considerable  fraction  of  those  which  are  frequently  i/o  bound.  Solutions 
applicable  to  this  particular  class  of  problems  are  thus  of  interest.  Further- 
more, an  understanding  of  the  solutions  developed  in  this  chapter  will  enable 
the  reader  to  better  generalize  these  results  to  those  of  later  chapters. 

The  types  of  programs  to  be  analyzed  in  this  chapter  include  those 
with  data  structures  which  are  derived  from  partial  differential  equation 
(PDE)  problems.  These  data  structures  are  characterized  by  a  large  partitioned 
array  or  mesh  which  consists  of  one  or  more  variables  at  each  mesh  point. 
Except  for  small  areas  of  the  mesh  close  to  its  boundaries,  the  mesh  is  assumed 
to  be  homogeneous  and  continuous  and  computations  on  the  variable (s)  at  meBh 
points  or  nodes  will  use  the  values  of  neighboring  variables.  Thus,  one  signi- 
ficant property  of  this  class  of  programs  is  the  requirement  of  data  from 
partitions  other  than  the  one  being  directly  computed  upon. 

The  problem  to  be  solved  here  is  the  minimization  of  latency  for 
programs  whose  data  cannot  be  contained  in  primary  memory.  An  example  of  the 
class  of  programs  being  considered  are  those  programs  which  obtain  the  solution 
of  large  PDE  problems  by  finite  difference  methods.  The  following  discussion, 
then,  will  be  restricted  to  those  programs,  but  it  should  be  remembered  that 
the  methods  presented  here  can  be  used  directly  for  other  computations  with 
the  same  data  structures . 
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A  program  for  which  latency  is  to  be  eliminated  is  partitioned  into 
one  or  more  computational  processes,  or  processes,  each  of  which  compute  on 
the  data  base.   The  processes  are  chosen  for  the  predictability  of  their 
running  times  as  well  as  their  relatively  "long"  running  times.  By  long 
running  times  we  mean  that  the  time  for  a  computational  process  to  compute 
on  an  input  block  should  be  no  less  than  the  transmission  time  between  the 
input  buffer  and  primary  memory  for  an  input  block.  It  is  assumed  that  each 
process  will  be  applied  iteratively  to  the  entire  data  structure. 

This  problem  will  be  explicitly  solved  for  those  programs  consisting 
of  a  single  process.   Solutions  will  then  be  given  for  generalizations  to 
multi-process  programs.  To  simplify  the  following  discussion,  all  transmission 
times  between  memories  are  assumed  to  be  zero. 

2.2  Single  Process  Programs 

The  discussion  in  this  section  is  limited  to  those  programs  containing 
a  single  process.   This  set  of  operations  occurs  iteratively  over  the  data 
points  of  the  mesh  and,  in  general,  in  iterative  sweeps  over  the  mesh.   That 
is,  if  a  sweep  is  an  application  of  the  process  to  all  of  the  data,  real 
programs  will  contain  several  sweeps  over  the  data. 

These  data  are  usually  altered  during  each  sweep  although  the 
collection  of  all  relevant  variables  will  be  imprecisely  referred  to  as  "the 
mesh"  or  "the  data." 

One  feature  that  makes  finite  difference  equation  data  structures 
difficult  to  handle  is  the  interactions  between  mesh  data  points.  While  the 
overall  array  may  be  partitioned  into  blocks  each  of  which  fits  into  the 
primary  memory,  these  blocks  must  interact  in  that  values  along  their  boundaries 
are  required  in  the  computation  using  adjacent  blocks.  Edge  values  are 
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variables  along  the  boundaries  of  these  data  blocks  and  must  be  passed  from 
one  subarray  to  the  next.  For  example,  in  a  simple  five  point  difference 
scheme  such  as  illustrated  in  Fig-.  2-1,  each  point  is  updated  in  terms  of 
its  four  orthogonal  mesh  neighbors,  d  is  the  depth  of  the  finite  difference 
operator.  When  the  computation  is  operating  on  a  row  at  the  edge  of  some 
block,  data  values  from  another  partition  are  required.  Figure  2-2 
schematically  shows  the  edge  values  of  data  blocks. 

For  large  problems  too  much  space  would  be  required  if  all  these 
edge  values  were  stored  in  primary  memory  during  the  entire  computation.   Thus 
we  seek  a  scheme  which  periodically  writes  these  edge  values  as  well  as  the 
partitions  of  the  overall  array  into  the  secondary  storage.   The  positions 
into  which  these  data  are  written  are,  of  course,  critical.  We  wish  to  write 
them  in  such  a  way  that  when  they  are  again  required  for  computation  they  will 
be  available  with  very  small  or  no  latency. 

A  critical  measure  is  the  ratio  of  a  process'  computation  time  on 
a  partition  to  the  time  it  takes  to  bring  a  new  partition  into  primary  memory 
from  the  disk.  For  disks  which  can  only  read  or  write  but  not  both  at  any 
one  time,  the  theoretical  lower  limit  of  this  ratio  is  not  less  than  two. 
That  is,  if  we  could  read  a  block  of  data  and  write  another  block  of  data  in 
precisely  the  time  it  takes  to  compute  on  a  third  block,  the  ratio  would  be 
two  and  theoretically  a  zero  latency  solution  would  be  feasible. 

For  disks  which  could  both  read  and  write  at  the  same  time  this 
limit  is  one.  That  is,  if  the  compute  time  for  a  block  of  data  were  no  less 
than  the  data  rate  from  the  disk  for  that  block,  a  solution  is  feasible. 
Intuitively  what  this  means  is  that  if  the  data  to  be  calculated  upon  cannot 
be  retrieved  from  the  disk  at  least  as  fast  as  it  is  needed  by  the  processor, 
no  zero  latency  solutions  can  be  found  for  large  l/0-bound  problems. 
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NEIGHBORING     MESH    POINTS 
REQUIRED    FOR    THE   COMPUTATION 


MESH    POINT  BEING 
COMPUTED     UPON 


Fig.   2.1 
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Data  blocks  are  ideally  so  spaced  apart  on  the  disk  that  the  time 
between  reading  of  two  successive  blocks  is  identical  to  the  time  required 
for  a  block  to  be  computed  upon.   Real  world  constraints,  however,  preclude 
such  a  precise  positioning.   Real  world  constraints,  however,  preclude  such 
a  precise  positioning. 

While  a  disk  is  physically  a  homogeneous  recording  medium,  it  gets 
formatted  such  that  logically  the  possible  storage  locations  are  quantized. 
Thus,  words  or  bits  may  only  be  recorded  at  particular  locations.   The  origin 
or  address  of  a  data  block  on  the  disk  is  uniquely  defined  by  its  track  and 
angular  position.   The  area  or  space  on  the  disk  occupied  by  a  data  block  are 
defined  by  its  track  and  sector.  A  sector,  then,  is  a  specifically  oriented 
angular  displacement  on  the  disk. 

Because  data  blocks  may  only  start  at  addressable  locations  on  the 
disk,  they  will  start  at  the  next  possible  addressable  location  following  the 
theoretical  and  unrealizable  location  that  its  compute  time  would  call  for. 
This  rule  will,  of  course,  introduce  some  latency  into  the  solution.   This 
latency  as  a  ratio  of  processor  running  time  is  equal  to  the  incremental 
amount  per  block,  as  just  described,  divided  by  the  compute  time  per  block. 
This  figure  for  most  configurations  can  be  made  quite  small  by  allowing  a 
reasonably  large  block  size. 

The  scheme  dealt  with  here  will  be  described  as  having  blocks 
spaced  appropriately  on  the  disk.   No  further  mention  will  be  made  in  this 
chapter  of  the  second  order  consideration  mentioned  above.   This  effect 
being  negligible,  all  solutions  and  statements  relating  to  them  will  be 
referred  to  as  zero  latency  but  must  be  recognized  as  including  this  small 
incremental  latency. 
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The  first  iteration  of  the  process  over  the  data  blocks  (ignoring 
inter-block  data  requirements)  are  particularly  easy  to  provide  for  with  no 
latency.  The  data  blocks  are  simply  spaced  out  on  the  disk  in  one  con- 
tinuous logical  track.   The  blocks  are  spaced  appropriately  before  the 
program  beings,  and  the  process  procedes  to  work  on  its  data. 

What  we  have  yet  to  provide  for,  however,  is  further  iterations 
over  the  entire  mesh,  or  sweeps . 

It  is  necessary  to  consider  the  data  requirements  of  such  a  pro- 
gram.  In  general  an  algorithm  which  requires  successive  sweeps  over  its 
data  base  will  generate  a  series  of  intermediate  results,  or  updated  portions 
of  its  data  base.  That  is,  a  mesh  will  be  updated  on  each  sweep.  It  is  the 
updated  mesh  which  is  operated  upon  on  the  next  sweep. 

Since  we  have  assumed  that  the  program  contains  a  single  process, 
the  spacing  between  updated  blocks  will  be  identical  to  that  of  the  blocks 
before  updating.   The  process  is  thus  considered  to  have  both  input  data  and 
output  data.   The  input  data  are  made  up  of  the  input  blocks  which  the  process 
operates  upon  directly.  These  blocks  are  input  from  the  disk  for  the  process. 
The  output  data  are  made  up  of  the  output  blocks  which  are  the  result  of  the 
updating  operations  performed  by  the  process  upon  the  input  blocks.  Each 
output  block  is  output  to  the  disk  shortly  after  the  process  finishes  with  it. 

The  output  blocks,  when  placed  on  the  disk  during  the  i   sweep  over  a  mesh 

th 
are  the  input  blocks  for  the  i+1   sweep. 

An  instantaneous  description,  then,  of  the  data  block  memory  allo- 
cation for  a  system  without  buffer  memory  during  the  i   sweep  would  be  as 
follows : 
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That  block  being  computed  upon  is  in  primary  memory 
along  with  perhaps  several  adjacent  neighboring  blocks 
(some  already  processed  and  some  not  yet  processed). 

The  disk  contains  those  input  blocks  for  the  i 
sweep  which  have  not  yet  during  this  sweep  been  moved 
into  primary  memory. 

Also  on  the  disk  are  those  output  blocks  for  the 
i   sweep  which  have  already  been  computed  on  and  output 
during  this  sweep.   These  are  the  input  blocks  for  the  i+1 
sweep. 

This  is  a  steady-state  condition.   The  time  for  the  processor  to 
compute  on  a  block  is  the  block  compute  time,  C,  .   This  is  exactly  the  time 
interval  between  successive  input  block  transmissions  from  the  disk  and  exactly 
the  time  interval  between  successive  output  block  transmissions  to  the  disk. 
Thus,  at  any  point  in  time  the  disk  may  be  servicing  both  input  and  output 
requests . 

Providing  for  proper  data  block  spacing  for  a  single  sweep  only 
solves  half  of  the  latency  problem.  A  total  solution  must  also  provide  for 
inter-sweep  gaps.  Assume  that  the  first  sweep  originates  as  soon  as  the 
processor  receives  the  beginning  of  the  requisite  data  and  that  the  updated 
data  blocks  are  output  to  the  disk  shortly  after  being  updated.   The  dis- 
placement between  an  input  block  and  its  updated  version  (output  block)  on 
the  disk  is  therefore  a  function  of  C.  ,  the  two  transmission  times  between 
primary  and  secondary  (disk)  memory  and  the  amount  of  time  it  remains  idle  in 
primary  memory.   The  ideal  locations  on  the  disk  for  these  updated  blocks 
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correspond  to  the  length  of  time  it  takes  to  make  a  sweep.  That  is,  the 
ideal  displacement,  modulo  the  disk  rotation  time,  is  a  function  of  C,  and 
the  total  number  of  blocks  in  the  mesh,  both  of  which  are  fixed  by  the  pro- 
gram and  the  processor. 

Intuitively  this  means  that  if  no  special  arrangements  are  made, 
there  will  be,  in  general,  some  latency  between  sweeps.  The  amount  of  this 
inter-sweep  latency  can  vary  from  zero  to  as  much  as  an  entire  disk  rotation, 
with  the  value  distributed  uniformly  over  the  interval. 

This  latency  need  not  be  serious.  For  programs  which  take  many 
disk  rotations  per  sweep,  an  average  latency  of  half  a  rotation  per  sweep 
could  be  insignificant.   There  is,  however,  a  straightforward  means  of 
eliminating  the  latency. 

Buffer  memory  may  be  used  to  pick  up  and  store  the  input  data  when 
it  passes  under  the  disk's  read  heads  within  the  disk's  last  rotation  before 
the  process'  next  sweep  begins.  That  is,  if  there  would  normally  be  a  half 
rotation  of  latency,  half  a  rotation  of  data  would  be  "cycled  through"  the 
buffer  memory.  Each  input  block  would  be  read  from  the  disk  at  that  point 
at  which  it  passed  under  the  read  head  and  buffered  for  exactly  half  a 
rotation  until  the  processor  was  ready  for  it.   This  procedure  is  called 
origin  shifting  since  it  has  the  effect  of  shifting  the  origin  of  the  data 
on  the  disk  to  that  angular  position  which  would  yield  zero  latency.   Note 
that  the  origin  of  the  data  is  not  physically  shifted  on  disk.   What  is  ac- 
complished is  the  effect  of  having  thus  shifted  it.  Another  manner  in  which 
this  same  effect  is  obtained  is  that  of  using  an  output  buffer  to  delay  the 
actual  transmission  of  data  to  the  disk  by  an  amount  equal  to  what  is  neces- 
sary to  correctly  place  it  in  position  for  later  retrieval  with  no  latency. 
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These  two  processes  are  equivalent  and  both  are  called  origin  shifting  al- 
though only  the  latter  one  actually  causes  the  data  to  be  shifted  from  their 
natural  position  on  the  disk. 

It  is  necessary  to  consider  the  size  of  such  buffer  memory.  An 
upper  bound  comes  from  recognizing  that  no  more  than  one  rotation  need  be 
buffered.   (in  reality,  no  more  than  (l-e)  rotation  need  be  provided  for 
since  an  entire  rotation  corresponds  to  that  fortuitous  disk  layout  which 
produces  zero  latency.)  One  disk  rotation  of  data,  therefore,  is  sufficient. 
There  are  two  quantities  used  in  this  paper  to  express  that  frequently- 
occurring  quantity: 

Q    is  the  maximum  amount  which  could  be  recorded 
Tnax 

on  a  single  rotation  of  the  particular  disk  under  con- 
sideration.  This  is  the  most  conservative  estimate  of 
required  buffer  storage.   It  is  simply  equal  to  the 
number  of  bits  which  could  be  transmitted  by  the  disk  in 
a  single  rotation. 

If  the  disk's  bandwidth  is  D  bits/second  and  the 
disk  has  a  rotational  period  T  seconds, 

inax  ~   d 

a  more  realistic  quantity  representing  a  rotation  of  data  is  obtained  by 
considering  the  actual  data  requirements  of  the  processor  for  a  particular 
process.   Since  real  programs  on  any  given  computer  system  will  rarely  re- 
quire data  at  the  disk's  maximum  rate,  it  is  often  desirable  to  express 
memory  requirements  in  terms  of  the  process/processor's  data  needs. 
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This  quantity  is  denoted  as  Q  with  a  subscript 

appropriate  to  the  process  being  considered,  or,  in 

more  general  situations,  as  Q   _. 

For  a  block  of  BT  bits  which  takes  C,  seconds 
L  a 

to  compute  (update), 


<W  -  £)Ta 

d 


Chapter  5  contains  several  charts  which  give  values  for  Q    and  show  re- 
lationships  between  Qy,eal  and  Q   . 

It  should  be  noted  that  the  amount  of  memory  necessary  to  eliminate 
inter-sweep  latency,  Q,  is  totally  independent  of  mesh  size.  Thus  for  larger 
data  bases  with  the  same  program,  buffer  memory  does  not  grow.  Furthermore 
G    is  dependent  only  on  the  system's  disk's  characteristics. 

If  not  for  the  problem  of  the  intercommunication  between  data  blocks, 
the  single  process  zero  latency  solution  would  be  complete.  This  intercom- 
munication is  considered  here  and  a  method  is  given  in  which  it  can  be  handled 
in  a  manner  consistent  with  the  previous  results. 
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For  the  sake  of  explanation,  the  mesh  is  assumed  to  be  swept  in 
a  left-to-right,  top-to-bottom  sequence  as  shown  in  Fig.  2-3,  each  row  con- 
taining many  partitions  and  having  a  height  of,  in  general,  several  mesh  points. 

Edge  values,  because  they  are  a  relatively  small  fraction  of  an 
entire  data  block  may  be  lumped  in  groups  of  k  into  edge  value  blocks .   The 
value  of  the  integer  k  can  be  chosen  to  insure  that  the  total  size  of  the 
edge  value  blocks  is  at  least  a  minimum  i/O  transmission  for  the  disk  being 
used. 

There  are  four  types  of  edge  values:   Left,  right,  upper  and  lower. 
They  are  defined  in  the  expected  way  and  some  are  stored  redundantly,  once 
in  the  data  block  in  which  they  naturally  occur  and  once  in  the  edge  value 
block. 

Fig.  2-2  shows  edge  values  relative  to  their  data  blocks.  Right  edge 
values  of  a  particular  block  are  those  used  by  that  partition  to  the  right 
of  that  block  and  left  edge  values  are  similarly  defined.   Similarly  upper 
edge  values  of  a  block  are  used  by  that  block  above  the  original. 

Right  and  left  edge  values  present  no  problem.   They  may  be 
handled  without  the  use  of  edge  value  blocks  by  merely  retaining  in  primary 
memory  at  least  part  of  the  left  and  right  neighbors  of  a  block.   Thus  for 
those  edge  values  of  any  particular  block,  the  left  edge  values  are  used 
while  the  preceding  block  is  being  operated  upon.   This  is  because  the  block 
containing  the  left  edge  values  is  in  primary  memory  somewhat  sooner  than 
absolutely  necessary.   In  like  fashion,  the  right  edge  values  are  available 
to  the  right  neighbor  partition  because  the  block  containing  these  edge 
values  is  retained  in  primary  memory  until  its  right  edge  values  are  no 
longer  needed. 
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It  is  the  upper  and  lower  edge  values  which  require  their  own 
blocks  in  order  to  be  intercommunicated.  These  are  simply  output  inter- 
stitially  with  respect  to  data  blocks.  That  is,  an  edge  value  block  (con- 
sisting of  edge  values  for  k  successive  blocks)  is  output  between  two  data 
blocks.   It  is  simply  input  during  the  last  disk  rotation  before  it  is  re- 
quired. The  simple  rule  to  remember  in  scheduling  these  "mini -transmissions" 
is  that,  for  the  top-to-bottom  example  given,  lower  edge  values  have  a 
periodicity  corresponding  to  a  single  row  across  the  mesh.   Upper  edge 
values  are  required  with  periodicity  corresponding  to  an  entire  sweep  over 
the  mesh  minus  a  row  across  it.  Data  blocks,  of  course,  have  periodicity 
corresponding  to  exactly  an  entire  sweep  over  the  mesh. 

Upper  and  lower  edge  value  blocks  are  therefore  simple  lumped  in 
number  as  a  function  of  the  disk  into  redundant  edge  value  blocks.  These 
are  output  to  the  disk  at  the  first  inter-data  block  possibility  and 
retrieved  at  the  last  possibility  before  being  needed  by  the  program.  Their 
effect  upon  storage  requirements  is  negligible  and  the  only  requirement  of 
this  method  is  that  there  exist  inter-data  block  gaps .  These  gaps  will  in 
fact  exist  in  all  but  those  systems  which  are  so  matched  to  the  programs 
running  on  them  that  the  data  requirements  are  identical  to  the  disk's 
capacity. 


■*Some  partial  differential  equation  solutions  are  obtained  by  using  the  non- 
updated  lower  edge  values  for  each  sweep,  saving  the  updated  values  for  the 
next  sweep.   In  such  cases  the  updated  lower  edge  values  during  the  i 
iteration  are  output  for  use  during  the  i+1   iteration.   Their  periodicity 
is  then  an  entire  sweep  over  the  mesh  plus  a  row  across  it.  During  the  sweep 
across  any  row  there  exist  two  sets  of  lower  edge  values  for  that  row:   the 
one  being  generated  for  use  in  the  next  iteration  over  the  mesh  and  those 
from  the  last  iteration  which  will  be  used  on  the  sweep  of  the  next  row. 
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As  a  measure  of  the  size  of  the  edge  value  blocks,  consider  RLg. 
2-1.   d  is  the  depth  of  the  finite  difference  operator.   In  this  case  d  is 
only  one  but  may  range  up  to,  say,  six  or  seven.   This  same  depth  is  also 
obviously  the  depth  of  the  edge  value  "strip".   Thus,  if  a  data  block  is  a 
partition  of  the  original  mesh  made  up  of  m  (height)  by  n  (width)  mesh 
points,  the  size  of  an  upper  or  lower  edge  value  block  need  be  no  larger  than 
nd  mesh  points.   Since  m  will  always  be  chosen  such  that  m  >  >  d  to  minimize 
the  number  of  edge  values  which  need  be  passed,  the  size  of  edge  blocks  will 
similarly  be  much  smaller  than  the  data  blocks. 

It  is  also  possible  for  mesh  problems  to  be  three  dimensional  rather 
than  two  dimensional  as  previously  described.   In  such  a  case  we  think  of  the 
two  dimensional  mesh  as  described  as  being  a  plane.   In  general,  then,  a 
sweep  over  a  plane  in  the  three  dimensional  model  will  have  both  edge  values 
within  its  plane  (as  described)  as  well  as  edge  values  which  are  the  actual 
data  blocks  in  adjacent  planes.   In  this  case  the  inter-plane  edge  value 
blocks  are  identical  in  size  to  the  actual  data  block  being  processed. 

It  is  also  possible  to  allow  the  edge  blocks  to  become  as  large  as 

the  data  blocks  themselves  and  apply  the  method  of  Chapter  5  to  obtain  a 

zero  latency  solution. 

2. 3  Two  Computational  Process  Programs 

Two  process  programs  are  those  which  have  two  processes  which 
alternately  sweep  the  entire  mesh.   The  particular  complication  introduced 
by  generalizing  to  two  processes  is  the  fact  that  the  two  processes  have 
different  compute  times,  C  ,  for  their  respective  blocks  of  data.   Thus,  for 
a  mesh  with  its  data  blocks  properly  spaced  on  the  disk  for  one  of  the  pro- 
cesses, the  output  blocks  sent  back  to  the  disk  will,  by  the  method  of  the 
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previous  section,  be  spaced  in  like  fashion.   The  next  process,  however,  be- 
cause its  compute  time  is  different,  will  not  be  able  to  operate  at  its  no- 
latency  tempo.   It  requires  a  different  spacing  of  these  data  blocks  (its 
input  blocks)  -  one  which  corresponds  to  its  particular  speed. 

It  is  obvious  that  the  data  blocks  cannot  migrate  over  the  disk. 
Once  they  are  output,  their  spacing  is  fixed  and  they  will  be  input  with  the 
identical  spacing.  Furthermore,  all  the  problems  considered  in  this  paper 
are  assumed  to  be  non-core  contained,  thus  precluding  the  trivial  solution 
of  keeping  the  entire  mesh  in  primary  memory. 

The  method  of  solution  presented  here  uses  a  small  amount  of 
buffer  memory  to  constantly  alter  the  critical  spacing  of  the  data  blocks. 
The  result  is  a  no-latency  solution  in  which  the  processor  is  always  being 
supplied  its  requisite  data  at  the  proper  rate. 

The  following  assumptions  are  made  concerning  the  nature  of  the 
problem . 

There  will  be  several  sweeps  over  the  mesh  by  both 
processes.   This  will  require  a  steady-state  solution  in 
which  each  of  the  two  processes  is  assumed  to  both  precede 
and  follow  the  other.   (The  result  will  yield  a  zero 
latency  solution  in  case  of  a  single  iteration  by  each  of 
the  kernels.) 

A  sweep  over  the  mesh  is  made  first  by  one  process 
over  all  of  the  mesh  and  then  by  the  other  process  over 
all  of  the  mesh. 
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The  pattern  of  sweeping  must  be  identical  for  both 
processes.   That  is,  the  sequence  of  partitions  of  the 
mesh  must  be  identical  for  each  process  and  for  all  sweeps. 

In  describing  the  transition  between  any  two  processes,  it  will  be 
necessary  to  consider  the  relative  speeds  or  compute  times  of  the  two.  One 
process  will  be  faster  than  the  other  and  this  one  will  be  called  the  fast 
process.   Similarly,  the  slower  process  is  christened  the  slow  process .   The 
fast  process  takes  less  compute  time/datum  than  the  slow  process.   The  two 
process  problem  is  defined  such  that  both  processes  operate  on  precisely  the 
same  data  over  their  respective  sweeps.   Thus,  as  one  would  expect,  the  slow 
process  takes  more  time  per  sweep  than  the  fast  process. 

The  data  per  sweep  being  constant  and  the  time  per  sweep  being 
different,  the  density  requirements  of  the  two  processes  are  different.   The 
fast  process  requires  data  from  the  disk  at  a  higher  rate  than  does  the  slower 
process.  This  creates  one  of  the  two  potential  latency  producing  problems. 
Density  buffering  is  the  process  of  altering  the  density  of  the  data  comprising 
a  mesh  from  the  density  corresponding  to  one  process  to  that  density  corre- 
sponding to  the  other  process. 

The  second  possible  area  in  which  latency  could  be  introduced  in  a 
two  process  program  is  virtually  identical  to  that  of  the  one  process 
problems  -  intersweep  transitions.   That  is,  when  a  process  is  ready  to  begin 
computing,  the  data  probably  is  not  positioned  to  begin  flowing  from  the 
disk.  To  insure  no  latency  at  this  point,  the  single  process  solution  uses 
buffer  memory  to  temporarily  store  up  to  a  rotation  of  this  data  and  thus  in- 
sure its  accessibility  when  necessary.  The  use  of  buffer  memory  in  this 
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manner  is  again  called  origin  shifting  since  the  memory  functionally  shifts 
the  origin  of  the  data  stream  to  correspond  to  the  processor's  timing  re- 
quirements . 

Both  density  buffering  and  origin  shifting  are  handled  differently 
as  a  function  of  whether  the  data  are  being  passed  from  the  fast  process  to 
the  slow  process  or  vice  versa.   These  will  be  explained  separately  for 
fast-slow  and  slow- fast  transitions. 

2.3*1  Density  Buffering,  fast  process  to  slow  process  transition 

Transition  here  refers  to  the  altering  of  the  data  spacing  on  the 
disk,  i.e.  density,  from  that  corresponding  to  the  fast  process  to  that  of  the 
slow  process.  Thus  the  transition  is  that  of  the  data  format  and  not  that  of 
the  processor  as  it  switches  from  one  process  to  the  other.   The  data  trans- 
mission, as  will  be  explained,  occurs  while  one  of  the  processes  is  operating. 

In  the  case  of  the  fast-slow  transition,  the  data  blocks  are 
assumed  to  be  stored  on  the  disk  at  the  fast  process'  density.  The  goal  is 
to  input  data  from  the  disk  at  this  rate  and  provide  it  the  processor  at  the 
slow  process'  rate.   This  can  be  accomplished  in  several  ways,  but  it  is 
necessary  to  keep  buffer  memory  requirements  low.   The  following  solution 
is  given: 

Each  rotation  of  the  disk  contains  enough  data  blocks  to  supply 
the  processor  with  its  slow  process  computation  for  some  period  of  time 
which  is  longer  than  a  disk  rotation.   (intuitively,  one  rotation  of  data 
at  the  higher  density  corresponds  to  more  than  one  rotation  of  data  at  the 
lower  density.)  If  the  ratio  of  the  higher  to  lower  density  is  denoted  r, 
one  rotation  of  data  as  it  is  stored  on  the  disk  will  keep  the  processor 
going  for  r  rotations  (r  >  l)  at  the  slow  process  speed. 
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This  method  assumes  initiation  of  data  transfers  from  the  disk  only 
at  some  arbitrary  origin.   This  is  assumed  to  be  that  angular  position  at 
which  the  first  block  of  the  sweep  is  stored  on  the  disk.   The  original 
transmission  will  start  with  this  block  and  continue  for  an  integral  number 
of  disk  revolutions .  All  subsequent  inputs  of  data  corresponding  to  this 
sweep  will  be  initiated  at  the  same  angular  position  (which  is  where  the 
previous  transmission  stopped)  and  continue  for  integral  numbers  of  disk 
revolutions. 

As  data  blocks  are  being  input  at  the  higher  data  rate,  the  process 
is  making  use  of  it  at  a  lesser  rate.   Thus  there  is  a  net  increase  of  data 
in  buffer  memory  at  such  times .   Input  from  the  disk  is  halted  at  its  origin 
every  time  the  supply  of  unused  data  already  in  memory  is  sufficient  to  supply 
the  processor  for  at  least  one  full  disk  rotation.  Data,  then,  starts  to 
accumulate  in  buffer  memory  as  soon  as  input  from  the  disk  occurs.  At  certain 
points  during  the  computation  however,  the  accumulated  data  is  sufficient  to 
supply  the  processor,  and  input  ceases  temporarily.  In  this  way  buffer  memory 
is  bounded  while  still  providing  the  processor  with  data  at  its  required  rate. 
Stated  concisely  as  an  algorithm,  we  have 
Algorithm  1: 

i)   Input  is  initiated  as  the  origin  of  the  data  passes 

under  the  read  head  of  the  disk  just  before  it  is  required 

by  the  processor  and  continues  until  the  conditions  in  ii 

are  satisfied  (at  least  a  full  disk  rotation) 

ii)   at  successive  rotations  of  the  disk  to  its 

angular  origin  (i.e.  integral  number  of  rotations  from 
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the  origin  of  the  data),  when  the  amount  of  data  in  the 
buffer  is  greater  than  that  amount  needed  by  the  slow  process 
for  the  next  entire  disk  rotation,  input  terminates 

iii)  at  successive  rotations  to  the  data  origin,  input 
is  initiated  (or  continued)  when  the  amount  of  data  in  the 
input  buffer  is  insufficient  to  keep  the  processor  supplied 
for  the  next  disk  rotation. 

This  bound  is  obtained  by  considering  the  worst  case.   Let  the 
amounts  of  data  in  bits  required  by  the  fast  and  slow  processes  for  the 
time  corresponding  to  a  revolution  of  the  disk  be  CU  and  (3,  respectively. 
Thus  a   =  pr. 

From  the  description  of  the  fast-slow  density  buffering,  if  we  are 
to  allow  for  any  value  of  r  >  1  and  arbitrarily  large  data  entities,  it  is 
obvious  that  baffer  memory  must  be  at  least  as  large  as  p  bits.   The  only 
time  at  which  input  will  cease  is  when  the  origin  of  the  disk  is  passed  over 
by  the  read  head(s)  and  there  are  at  least  (3  bits  in  buffer  memory. 

The  best  case  is  when  exactly  the  requisite  p  bits  are  resident  at 
the  time  the  origin  is  passed  over.  The  worst  case,  then,  is  when  there  are 
P-e  bits,  thus  necessitating  an  input  transmission.  At  the  end  of  another 
disk  rotation,  a   bits  have  been  input  and  p  bits  have  been  "used"  by  the 
processor,  leaving  a- e  bits  in  buffer  memory.   This  being  sufficient  to 
supply  the  processor  for  at  least  the  next  disk  rotation,  transmission  stops. 
Input  resumes  only  when  the  contents  of  buffer  memory  fall  below  (3  bits  again. 
By  this  algorithm,  then,  the  amount  of  buffer  memory  required  for  density 
buffering  is  never  larger  than  a   bits.   The  quantity  a,  however,  is  identical 
to  0^.  ^  a  rotation  of  data  at  the  fast  process'  rate. 
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This  process  can  be  illustrated  by  a  simple  example.   In  Fig.  2-U, 
the  curve  labelled  "data  from  disk"  is  the  cummulative  number  of  bits  input 
to  buffer  memory  from  time  0  .  Note  that  transmissions  initiate  and  terminate 
at  integral  numbers  of  disk  rotations  from  the  origin. 

The  line  below  the  axis  is  the  integral  of  the  amount  of  data  trans- 
mitted to  the  processor  and  hence,  data  from  buffer  memory.  The  dotted  line 
represents  the  amount  of  data  resident  in  the  input  buffer  and  is  simply  the 
sum  of  the  other  two  curves . 

In  the  example  diagrammed,  the  scale  is  such  that  Cd  =  —  p.   That  is, 
the  fast  process  uses  data  at  =■  the  rate  of  the  slow  process  or,  in  other 

o 

terms,  the  fast  process  operates  on  the  same  data  in  ft  the  time  of  the  slow 

process. 

Theorem  1: 

For  the  algorithm  described  for  density  buffering  of  data  for  the 
fast-slow  transition,  no  more  than  Q     buffer  memory  is  required. 
Proof 

Data  transmissions  when  initiated  will  read  or  write  data  over  an 
integral  number  of  disk  rotations  -  origin  to  origin.   Furthermore,  data  will 
always  be  flowing  to  the  processor  more  slowly  than  it  is  input  to  the  input 
buffer  because  (X   >  p.  Therefore,  the  maximum  amount  of  data  that  will  have 
to  be  accommodated  by  buffer  memory  will  be  that  occurring  at  exactly  the 
end  of  a  transmission. 

Assume  that  this  quantity  is  larger  than  OC,    say  CH+x.   Then,  by  the 
algorithm,  one  rotation  earlier,  Oi   less  bits  would  have  been  input  and  p  less 
bits  would  have  been  forwarded  to  the  processor.   That  is,  one  rotation 
before  the  quantity  Q!+x  was  in  buffer  memory  the  quantity  p+x  would  have  to 
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have  been  in  buffer  memory.   But  this  is  a  contradiction  because  the 
algorithm  calls  for  input  to  terminate  should  the  buffer  contain  this  much 
data  at  this  point 

Q.E.D. 

2.3.2  Density  Buffering,  slow  process  to  fast  process  transition 

Here  data  blocks  are  accumulated  in  an  output  buffer  while  the  slow 

process  is  processing  more  data.   Roughly  speaking,  when  the  amount  of  such 

accumulated  data  is  sufficient  for  a  disk  rotation  at  the  fast  process'  density, 

an  output  is  initiated. 

Specifically,  the  algorithm  is  as  follows: 

Algorithm  2: 

When  the  slow  process  begins  its  computation,  its  output 
data  are  accumulated  in  the  output  buffer  for  the  first  disk 
rotation. 

At  the  origin  on  the  disk  of  the  second  and  subsequent 
rotations  r 

i)   if  the  amount  of  data  already  accumulated  in  the  output 
buffer  plus  the  amount  to  be  generated  in  the  next  rotation 
(p)  is  greater  than  or  equal  to  a,  output  of  this  data  is 
initiated  now  and  continues  for  an  entire  rotation. 

ii)  if  the  amount  of  accumulated  data  is  <  a-3,  continue 
to  accumulate  data  for  another  rotation. 

Thus  the  output  buffer  allows  data  accumulation  until  a  full  high 
density  output  transmission  can  occur.   The  data  initiation  with  accumulated 
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data  of  a-p,  rather  than  simply  a   corresponds  to  a  one  rotation  "look-ahead". 
That  is,  by  knowing  that  at  all  times  during  the  rotation  sufficient  data  will 
be  available  to  continue  transmitting  to  the  disk  at  the  higher  density,  we 
can  initiate  the  output  at  the  start  of  that  rotation. 

Theorem  2: 

For  the  algorithm  described  for  density  buffering  of  data  for  the 
slow- fast  transition,  no  more  than  a   bits  of  buffer  memory  is  required. 
Proof 

The  maximum  amount  of  data  ever  held  in  the  output  buffer  occurs 
immediately  before  an  output  transmission.   Suppose  at  some  time  an  amount 
of  data  d  >  Qt   is  resident  in  the  output  buffer,  say  d  =  a+x.   Then  a 
rotation  before,  there  was  a+x-p  in  the  output  buffer.   But  this  contradicts 
the  stated  algorithm  because  this  amount  of  data  would  have  triggered  an 
output  transmission  at  that  time. 

Q.E.D. 

Note  that  a  may  be  as  large  as  Q,  that  is,  a  rotation  of  data.   Fig.  2-5  shows 

the  data  movement  for  this  algorithm  for  the  same  parameters  as  in  Fig.  2-4. 

k 

a  =  -p. 

2. 3.3  Origin  Shifting  with  Density  Buffering 

All  density  buffering  takes  place  while  the  slow  process  is  computing. 
Intuitively  this  is  because  density  buffering  between  any  two  densities  will 
always  take  as  long  as  the  slower  process.   If  this  procedure  is  to  be  com- 
pletely masked  from  the  processor,  it  must  be  concurrent  with  the  slower 
process'  execution.   This  requires  a  total  buffer  memory  of  2Q  for  density 
buffering,  because  each  of  the  buffers  may  be  as  large  as  Q.   It  might  be 
though  of  as  configured  as  in  Fig.  2-6.   But  this  does  not  account  for  origin 
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shifting  to  eliminate  intersweep  latency.  In  general,  data  will  not  be  at 
exactly  the  angular  position  to  be  read  at  the  right  instant  for  the  beginning 
of  a  sweep.  There  is,  however,  a  scheme  permitting  buffering  to  eliminate 
this  type  of  latency  without  requiring  any  more  memory  than  that  already 
necessary  for  density  buffering. 

Observe  that  origin  shifting  may  be  accomplished  "in  advance"  by 

storing  data  in  an  output  buffer  until  it  may  be  placed  at  precisely  that 
point  which  corresponds  to  the  angular  position  at  which  it  can  ultimately  be 
read  for  no  latency.   That  is,  data  may  be  made  to  appear  to  have  shifted  on 
the  disk  by  either  using  an  input  buffer  to  read  it  early  and  hold  it,  or  by 
using  an  output  buffer  to  actually  shift  it.  The  amount  of  buffer  memory  is 
identical  for  either  method:  A  rotation  of  the  disk  or  Q. 

The  scheme  for  density  buffering  does  not  permit  origin  shifting 
during  the  slow  process'  execution  without  additional  buffer  memory.   It 
does  however  permit  origin  shifting  during  the  fast  process'  execution.  In 
fact,  since  no  density  buffering  occurs  at  this  time,  origin  shifting  may 
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occur  in  both  the  input  and  output  buffers.  The  input  buffer  will  shift  the 
partitions  of  data  as  they  come  off  the  disk  and  eliminate  latency  for  the 
fast  process.  The  output  buffer  will  place  the  partitions  in  the  proper 
position  on  the  disk  for  the  slow  process.  Thus  both  origin  shifting  and 
density  buffering  may  be  accomplished  with  only  2Q  of  buffer  memory. 

2.k     Generalization  to  N-Processes 

One  and  two  process  mesh  sweeping  may  be  graphically  represented  as 
in  Fig.  2-7,  a  and  b,  respectively.   Such  illustrations  are  called  code  graphs, 
The  nodes  represent  processes  and  the  directed  links  correspond  to  data 
transmissions.   In  the  particular  case  of  the  mesh  problems  already  considered 
in  this  chapter  the  links  also  correspond  to  the  order  of  execution.   This 
need  not,  however,  always  be  true.   The  "upward  arrows"  are  necessary  because 
of  the  iterative  nature  of  the  problem. 

Fig»2-7c  shows  an  n-process  mesh  problem.  This  is  simply  one  in 
which  several  processes  operate  on  a  common  data  base  in  succession.   Each  is 
assumed  to  take  a  different  time  to  compute  as  the  other  processes.  Thus 
a  zero  latency  solution  must  here  perform  both  density  buffering  and  origin 
shifting. 

The  solution  to  this  is  a  simple  generalization  of  the  two-process 
solution.  If  the  solution  for  the  two-process  problem  is  considered  as  con- 
sisting of  two  transition  components,  it  can  be  stated  as  follows: 

Fast/slow  process  transitions  have  origin  shifting 
taking  place  in  the  output  buffer  during  the  fast  process 
execution.   Density  buffering  occurs  in  the  input  buffer 
while  the  slow  process  is  executing. 
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Slow/fast  process  transitions  perform  the  density- 
buffering  in  the  output  buffer  during  the  slow  process 
execution.   Origin  shifting  is  accomplished  in  the  input 
buffer  during  the  fast  process  execution. 

N-process  problems,  then,  are  simply  like  two-process  problems  in 
which  there  is  some  sequence  of  these  slow/fast  and  fast/slow  sequences.   The 
only  significant  difference  is  that  in  an  n-process  problem,  any  particular 
process  need  not  be  simply  fast  or  slow.   The  attribute  of  being  either  fast 
or  slow  is  a  function  of  the  process'  neighbor.   Since  in  the  n-process 
problem  a  process  has  two  neighbors,  it  is  not  possible  to  characterize  that 
process'  speed  as  one  or  the  other. 

If  a  process  is  faster  than  its  predecessor  process,  then  it  is 
"fast"  with  respect  to  that  predecessor  and  that  transition  from  its  prede- 
cessor is  a  slow/ fast  one.   If  a  process  is  faster  than  its  successor  process, 
it  is  "fast"  with  respect  to  its  successor  and  that  transition  to  its  successor 
is  a  fast/slow  one.  A  process  may  however  be  fast  with  respects  to  its 
predecessor  and  slow  with  respects  to  its  successor  or  vice  versa.   The  impor- 
tant point  is  that  an  interprocess  transition  will  occur  in  one  of  two  ways, 

both  of  which  are  provided  for  in  the  solution. 

2.4.1  Data  Rate  Considerations  for  Some  n-Process  Problems 

The  statement  of  n-process  problems  has  considered  only  that  data  in 

common  to  the  processes.   Some  algorithms  may  be  such  that  they  consist  of 

several  processes,  each  of  which  operates  on  different  variables  at  each  point 

in  the  mesh.   That  is,  there  could  be  a  set  of  variables  over  the  mesh  which 

is  common  to  all  of  the  processes  and  other  classes  of  variables  which  are 

operated  upon  only  by  specific  individual  processes. 
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One  obvious  way  to  handle  this  case  is  to  lump  all  of  the  variables 
and  handle  the  entire  mesh  as  if  it  were  common  to  all  nodes.  This  would 
require  origin  shifting  and  density  buffering  of  every  variable  for  every 
process,  even  though  some  processes  would  not  make  use  of  all  of  these  variables. 

Somethimes,  however,  the  overhead  of  carrying  the  "deadwood"  variables 
along  might  cause  the  data  rate  from  the  disk  to  be  exceeded  where  the  data 
rate  of  the  process  (which  just  considers  the  necessary  input  data)  would 
permit  a  solution.  These  situations  may  be  solved  by  a  decomposition  of 
solutions.  The  original  mesh  is  split  up  into  several  data  bases,  each  made 
up  of  different  variables. 

All  of  the  possibilities  of  such  decomposition  will  not  be  enumerated 
in  this  paper.  An  example  is  given  however  of  one  such  decomposition  for  a 
two-process  problem. 

The  code  graph  of  Fig.  2-7  b  is  modified  to  that  of  Fig  2-8  to  show 
all  of  the  data  transmissions  discussed  in  the  two  process  decomposition. 


Fig.  2-8 
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A  no  latency  solution  for  such  a  program  will  consist  of  the  solution 
to  the  two  process  portion  plus  the  two  one  process  solutions.  That  is,  there 
are  three  sets  of  data  or  meshes  which  are  each  provided  for  independently. 
The  nature  of  the  solutions  allows  for  this  "interspersing"  of  data  while 
insuring  correct  arrival  at  the  processor  with  only  minor  additional  consider- 
ations.  One  of  these  considerations  is  the  proper  handling  of  data  rates 
to  and  from  the  disk. 

A  no  latency  solution  is  feasible  only  when  every  process  in  a 
program  requires  data  at  a  slower  rate  than  the  disk  can  provide.  Very  large 
non-core  contained  arrays  which  can  be  rapidly  "crunched"  will  have  no 
low  latency  solution  if  the  numbers  are  "crunched"  faster  than  they  may  be 
supplied.   The  point  with  respect  to  mixed  solutions  of  the  kind  being  con- 
sidered here  is  that  it  is  not  sufficient  that  the  processor/disk  data  rates 
be  compatible  for  each  of  the  processes  in  each  solution.   The  combined  data 

for  each  process  for  its  two  solutions  must  be  acceptable.   Simply,  the  sum 
of  the  processor  data  rates  of  the  two  solutions  for  each  process  must  be 
less  than  that  of  the  disk. 

The  buffer  memory  requirements  obtained  for  either  the  one-  or  two- 
process  solutions  are  tied  to  these  processor  and  disk  data  rates.   The  buffer 
requirements  for  combined  solutions  are  therefore  identical  to  the  memory 
upper  bound  of  the  two  process  solution.   This  is  because  no  matter  how  many 
solutions  for  any  one  process  are  combined,  as  long  as  the  sum  of  their  data 
rates  is  acceptable,  no  more  than  2Q    bits  of  buffer  memory  is  required  to 
process  the  data  at  that  rate. 

There  is  one  final  matter  to  be  handled  in  any  real  combined  solution. 
The  fact  that  the  combined  data  rates  to  the  processor  are  less  than  the 
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disk's  bound  does  not  guarantee  that  all  the  data  will  be  accessible  by  the 

disk.   Data  could  conceivably  be  output  to  the  disk  such  that  when  blocks 

from  two  different  solutions  are  to  be  subsequently  input,  they  are  in  the 

same  angular  position  on  the  disk.  Unless  the  disk  can  read  from  several 

heads/tracks  at  the  same  time,  there  is  no  way  to  input  all  the  required  data 

when  it  is  needed.   This  local  overlap  condition  occurs  in  spite  of  the  fact 

that  the  average  density  over  the  disks  total  rotation  is  acceptable. 

One  way  of  precluding  this  problem  is  to  merely  insure  that  overlaying 

solutions  will  have  their  data  blocks  interleaved  on  the  disk.   Except  for 
edge  value  blocks,  all  input  blocks  for  a  particular  process  have  the  same 
spacing  between  consecutive  blocks.   This  is  true  no  matter  how  many  disjoint 
streams  of  data  have  been  prescribed  by  separate  solutions.   Therefore  any 
overlapping  of  these  streams  can  be  corrected  by  shifting  one  entire  sequence 
of  these  blocks  relative  to  the  one  it  overlaps  with.   That  is,  the  periodicity 
of  any  two  or  more  streams  of  overlapping  data  blocks  is  identical  and  merely 
insuring  that  the  first  blocks  of  the  respective  streams  will  interleave 
forces  all  of  them  to.  This  can  be  done  at  the  time  the  two  solutions  are 
combined.  Compare  all  streams  of  input  blocks  in  the  combined  solution. 
Where  overlaps  exist,  one  or  more  of  those  streams  may  be  shifted  by  delaying 
the  original  output  requests  for  the  corresponding  blocks  the  amount  of  time 
necessary  to  eliminate  the  overlap. 

At  most,  this  procedure  will  increase  the  buffer  memory  requirements 
by  the  size  of  one  block  for  each  stream  which  will  be  delayed.   This  is 
clearly  very  much  smaller  than  the  amounts  of  buffer  memory  already  conserva- 
tively required  and  can  therefore  be  ignored. 
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Edge  value  blocks  in  combined  solutions  are  handled  very  much  the 
same  as  in  the  one  process  solution.   In  the  former  case,  however,  it  is  the 
combined  solution  itself  into  which  the  edge  values  are  inter stitially  placed. 
This  avoids  overlap  conflict  in  the  combined  solution. 

2. k.2     Sweeps  of  Less  Than  a  Disk  Rotation  in  Duration 

Thus  far,  an  implicit  assumption  was  made  in  the  statements  of  the 
previous  problems.   It  has  been  assumed  that  any  sweep  over  a  data  base  would 
take  at  least  as  long  as  a  single  rotation  of  the  disk.   Indeed  the  solutions 
to  these  problems  use  the  fact  that  data  may  be  output  to  the  disk  and  later 
input  for  adjacent  sweeps.   This  section  considers  the  case  in  which  a  process 
takes  less  than  the  time  of  a  disk  rotation  to  perform  a  single  sweep. 

It  Is  important  here  to  recall  one  of  the  first  assumptions  made 
about  all  of  the  problems  to  be  solved  in  this  paper.  All  of  the  input  values 
required  by  any  process  must  be  required  by  the  process  for  the  particular 
processor  being  considered  at  a  rate  not  to  exceed  the  capacity  of  the  disk 
to  provide  this  data.   Intuitively,  no  process  should  crunch  numbers  faster 
than  the  disk  can  supply  them. 

Sweeps  of  too  short  duration  to  make  use  of  the  disk  are  handled  in 
a  simple  straightforward  manner.   Instead  of  being  routed  from  the  output 
buffer  to  the  disk,  as  the  previous  solutions  call  for,  the  data  is  routed 
from  the  output  buffer  directly  to  the  input  buffer. 

What  this  means  is  that  for  sweeps  of  sufficiently  short  duration, 
the  required  data  are  entirely  buffer  memory  contained.   The  fact  that  the 
size  of  the  buffer  memory  is  sufficient  to  accommodate  this  data  follows 
directly  from  the  fact  that  a  rotation  of  data  is  provided  for  in  each  buffer. 
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Short  duration  processes  of  time  less  than  a  rotation  must  use  less  than  a 
rotation  of  data  for  input  and  less  than  a  rotation  of  data  will  be  output. 
Thus,  the  solutions  given  earlier  in  this  chapter  with  the  above 
modification  provides  a  means  of  generating  no  latency  solutions  for  the 
previously  considered  problems.  This  will  be  effective  no  matter  what  the 
duration  of  the  sweep  of  a  process  over  its  data  base. 
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3-    OTHER  SWEEPING  FROBLENE 

3.1  Introduction 

This  chapter  considers  a  more  general  program  structure  than  the 
last  chapter.   The  generalizations  include  more  diversified  data  transmissions 
than  n-process  problems  and  sweeping  patterns  which  are  not  row  by  row  as 
above,  but  regular  sweeping  patterns  as  are  found,  for  example,  in  matrix 
operations.   In  both  cases  a  significant  alteration  of  the  previous  problem 
formulations  is  the  explicit  merging  of  more  than  one  input  data  set  to  a 
process . 

In  previous  cases  only  a  single  data  entity  was  assumed  to  be  input 
to  each  process  from  other  processes.   The  problems  considered  here  permit 
several  such  entities  to  be  merged  and  input  to  the  process.  An  important 
requirement  for  this  procedure  is  that  the  combined  data  rate  of  all  sets  of 
data  to  be  input  to  a  node  does  not  exceed  the  bandwidth  of  the  disk.  This 
ensures  the  ability  to  interleave  the  data  blocks  on  a  single  logical  track 
without  overlap. 

3.2  Generalized  Data  Transmissions 

Fig.  3-1  a  shows  the  relationship  of  processes  to  data  transmissions 
for  the  problems  considered  thus  far.   Fig.  3-1  b  shows  a  more  general  type 
of  structure.   The  nodes  of  this  code  graph  correspond  to  operations  which 
occur  iteratively  to  variables  over  an  entire  data  entity.  As  such,  they 
have  the  same  meaning  as  nodes  in  the  n-process  mesh  problems:   each  node 
corresponds  to  a  computational  process.   Since  this  paper  will  consider  only 
serial  processors,  code  graphs  of  the  form  depicted  in  Fig.  3-1  b  can  be  dis- 
torted into  that  of  Fig.  3-1  c.   This  procedure  allows  a  structure  which  shows 
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the  temporal  sequence  of  process  execution,  and  is  not  in  general  unique. 
That  is,  given  an  arbitrary  code  graph,  there  may  be  several  distorted  ones, 
each  of  which  corresponds  to  an  allowable  sequence  of  execution  over  the 
processes . 

A  comparison  of  Fig.  3-1  a  and  c  show  that  the  only  real  difference 
introduced  into  the  problem  is  that  of  the  data  transmissions.   There  are  two 
methods  of  dealing  with  this  new  problem.   One  of  these  in  fact,  involves  one 
further  mapping  which  makes  the  structure  of  Fig.  3-1  b  become  that  of  the 
type  depicted  by  Fig.  3-1  a. 

All  of  the  data  operated  upon  by  all  of  the  process  can  be  lumped 

into  one  large  data  array.   This  is  done  such  that  the  variables  of  each  of 

the  original  smaller  arrays  are  uniformly  distributed  from  the  beginning  to 

the  end  of  the  new  combined  array.   This  large  array  can  be  thought  of  a 

single  mesh.   This  mesh  is  used  as  if  by  an  n-process  program  in  which  each 

process  has  the  entire  array  as  input  data.  However,  it  is  only  that  subset 

of  the  large  array  which  any  particular  process  requires  that  is  used  by  that 

process.   The  remainder  is  input  to  buffer  memory,  but  need  not  be  routed  to 

primary  memory.   This  unused  data  is  recombined  with  the  updated  data  from 

the  process  (output  data)  and  then  output  to  the  disk. 

The  large  combined  data  array  is  thus  input  and  subsequently  output 

for  each  process  in  the  problem.   It  is  constantly  being  density  buffered  and 

origin  shifted.   One  obvious  drawback  of  this  type  of  solution  is  that  the 

entire  data  base  of  any  such  problem  must  be  streamed  through  buffer  memory 

for  each  process.   This  would  require  a  bandwidth  from  disk  to  buffer  memory 

(and,  of  course,  from  buffer  memory  back  to  the  disk)  very  much  higher  than 

if  just  the  necessary  data  had  to  be  input.  This  is  a  problem  only  in  those 
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cases  where  the  disk  to  be  used  has  a  lower  bandwidth  than  is  required  by- 
using  the  above  method.   In  such  cases,  the  data  transmission  must  be  treated 
differently. 

For  transmission  of  required  data  only,  the  following  observations 
are  made : 

The  combined  input  data  rate  to  each  process  may 
not  exceed  the  maximum  deliverable  by  the  disk.   This  is 
the  sum  of  the  data  rates  corresponding  to  each  of  the 
inputs  to  a  multi-input  node. 

Merging  these  multi-inputs  for  a  process  is  not  a 
problem.  As  long  as  the  data  rate  requirement  is  met,  the 
various  input  data  streams  may  be  interleaved  when  output 
to  the  disk  before  the  merging  retrieval. 

The  sequence  of  execution  of  the  processes  in  a 
code  graph  is  not  important.   It  is  assumed  that  any 
sequence  under  consideration  is  legal  in  the  sense  that 
any  process  which  must  supply  data  for  another  is  executed 
first.   For  any  sequence  under  consideration,  then,  the 
only  matter  this  paper  is  concerned  with  is  the  data 
transitions  between  nodes.   For  this  more  general  problem, 
however,  the  data  transmissions  need  not  be  between 
adjacently  executed  processes. 

The  difficult  problem  therefore  is  that  of  multiply  routed  output 
data  from  a  process.   If  data  are  updated  at  some  particular  data  rate  (cor- 
responding to  the  speed  of  that  process  which  does  the  updating),  and  required 
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by  several  different  processes  at  differing  data  rates,  conflicts  may  develop. 
If  this  data  were  to  be  required  at  rates  both  faster  and  slower  than  the 
original  rate,  that  process  which  outputs  the  data  can  be  called  neither  a 
fast  nor  a  slow  process. 

If  it  is  to  prepare  its  output  data  for  a  faster  process,  density 
buffering  must  occur  during  the  process'  execution.  If  these  same  data  are 
to  be  ready  for  a  slower  process,  it  should  be  output  at  its  generated  data 
rate  and  density  buffered  while  the  slower  process  computes.  Clearly,  both 
of  these  cannot  occur  at  the  same  time,  unless  many  redundant  images  of  many 
of  the  data  blocks  are  to  be  cluttering  up  the  disk. 

Acutally  there  are  two  fairly  simple  methods  of  handling  this 
situation.   The  more  simple  one  to  explain  is  that  of  "speeding  up"  the  data. 

When  a  data  entity  is  first  placed  on  the  disk,  it  is  put  there  at 

the  highest  density  required  by  any  of  the  set  of  processes  which  uses  this 

array  as  input  data.   Similarly,  output  data  from  any  process  are  density 

buffered  to  the  highest  data  rate  that  any  of  the  successor  nodes  will  require. 

Each  process  may  therefore  expect  to  have  to  density  buffer  "slowing  down"  its 

input  data.   This  scheme  would  require  an  input  and  an  output  buffer,  each  of 

size  Q  to  do  density  buffering.   Intuitively,  to  completely  eliminate  latency, 

an  additional  (input  or  output)  buffer  of  size  Q,  would  also  be  needed  to  perform 

origin  shifting. 

It  will  be  shown  in  a  later  chapter  that  density  buffering  and 

origin  shifting  may  be  accomplished  concurrently  in  the  same  buffer  of  size 

Q.   For  the  purposes  of  this  chapter,  the  amount  of  buffer  memory  required 

for  the  above  scheme  is  bounded  by  30,  • 

Another  way  of  handling  the  problem  of  output  data  blocks  which 
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are  routed  to  several  processes  is  to  consider  the  entire  sequence  of  accesses 
of  those  blocks.  An  allowable  sequence  is  assigned  to  the  code  graph.  All 
output  data  are  placed  on  the  disk  at  that  density  corresponding  to  the  next 
process  which  will  use  said  data  as  input  data.   This  much  of  the  scheme  is 
identical  to  the  n-process  problem's  solution.   The  difference  arises  in  this 
problem  because  the  output  data  of  this  successor  process  is  not,  in  general, 
an  updated  version  of  its  input  data.   Thus,  while  this  process  is  executing, 
both  its  output  data  and  its  input  data  (if  used  as  input  data  for  other 
processes)  must  be  in  general  density  buffered.  As  long  as  the  maximum  data 
transmission  rate  is  not  exceeded,  this  method  will  accomplish  zero  latency 
scheduling  with  only  the  2Q,  buffer  memory  as  used  by  the  n-process  solution. 

3.3  Non-Standard  Sweeping  Patterns 

The  schemes  of  Chapter  2  have  provided  solutions  for  processes  which 
were  assumed  to  sweep  their  respective  data  entities.  They  were  expected  to 
operate  on  the  entire  set  of  their  input  blocks  on  each  sweep.  A  large  class 
of  problems  however  does  not  meet  this  requirement.   Matrix  operations  are  one 
such  example.   In  this  section  we  will  consider  zero  latency  solutions  for 
such  problems. 

McKellar  and  Coffman  in  [ll]  show  that  savings  can  be  effected  for 
this  type  of  problem  by  accessing  the  blocks  in  "good"  ways  relative  to  the 
partitioning  scheme  of  the  large  matrices.   They  are  concerned  with  efficiency 
of  storage  and  minimization  of  block  accesses,  whereas  we  are  not  concerned 
with  storage  efficiency  and  permit  any  number  of  accesses  since  they  are 
planned  so  as  not  to  introduce  any  latency. 

Consider  two  matrices  A  and  B  which  are  to  be  multiplied,  the 
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resultant  matrix  stored  as  C.   (i.e.  C<- A  x  B)  Such  operations  for  the  purposes 
of  this  paper  are  assumed  to  be  performed  for  matrices  very  much  larger  than 
the  available  primary  memory.  Thus,  the  matrices  are  partitioned  into  many 
smaller  arrays.  One  such  partitioning  is  shown  in  Fig.  3-2.  Such  partitions 
are  again  called  blocks.   Both  matrices  A  and  B  have  been  partitioned  into 
blocks.  Each  such  block  is  assumed  to  contain  several  elements  of  the  original 
matrix,  but  not  so  many  as  to  preclude  several  such  blocks  from  occupying 
primary  memory  at  the  same  time.   These  blocks  are  further  assumed  to  be 
square.   That  is,  they  contain  as  many  columns  as  rows  of  the  original  matrix. 
The  nomenclature  of  these  blocks  is  obvious  from  Fig.  3-2.  Multiplication  of 

the  partitioned  matrices  is  accomplished  in  the  standard  way,  i.e. 

n 

C.  .  =  S  A..  B.  .. 

ij   k=l     J 

The  problem  evident  in  most  matrix  operations  quickly  becomes 

obvious  as  soon  as  one  begins  to  consider  the  initial  disk  layout  for  this 

r 

simple  matrix  multiply  operation:  Matrices  simply  do  not  get  operated  upon 
in  a  standard  row-by-row  or  column-by-column  sweeping  fashion.  Each  element 
of,  say,  the  A  matrix  will  be  multiplied  by  (different)  elements  of  the  B 
matrix  many  times  and  while  there  are  many  different  possible  orderings  of 
such  operations,  none  will  allow  the  two  matrices  to  be  simply  swept  at  the 
same  rate  and  produce  the  desired  result. 

Another  matter  should  also  be  considered  at  this  point.  While 
considering  the  order  in  which  the  blocks  will  be  operated  upon,  it  should 
be  recognized  that  it  would  be  desirable  to  have  a  standard  matrix  layout 
for,  say,  multiplication.   That  is,  for  obvious  reasons,  an  ordering  over  the 
blocks  which  was  a  function  of  the  matrix's  size  or  of  whether  it  was  being 
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pre-  or  post-multiplied  would  be  unsatisfactory.  If,  for  example,  partitions 
had  to  be  placed  in  either  row  or  column  order,  depending  upon  whether  they 
were  being  pre-  or  post-multiplied,  many  matrices  would  have  to  be  stored 
redundantly  or  have  the  ordering  of  their  blocks  restructured  several  times 
during  the  execution  of  the  program  in  which  they  were  used.  Thus,  a  single 
standard  format  is  desirable. 

Solutions  to  this  type  of  problem  use  the  following  basic  technique. 
When  only  one  of  several  data  entities  which  are  to  be  merged  is  required  to 
be  swept  and  the  others  will  only  have  a  few  of  their  data  blocks  used  in 
the  same  interval,  all  of  these  blocks  may  be  stored  on  the  disk  at  the 
appropriate  spacing  for  the  swept  one.  It  is  always  possible  to  retrieve 
from  the  disk  a  few  blocks  of  an  entire  array  as  long  as  this  retrieval  rate 
is  less  than  that  rate  at  which  the  data  are  recorded  on  the  disk. 

For  matrix  operations,  there  are  several  ways  in  which  to  accom- 
plish a  zero  latency  solution.  One  such  method  for  matrix  multiplication 
will  be  shown. 
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For  the  operation  C<-AxB,  let  A  and  B  be  layed  out  on  the  disk 
such  that  their  respective  blocks  do  not  overlap  and  the  distance,  s,  between 
successive  blocks  corresponds  to  the  time  for  a  full  matrix  multiplication  of 
two  such  partitions.   Both  sets  of  partitions  are  in  row  order.   Refer  to 
Fig.  3-3- 

Clearly,  the  blocks  cannot  simply  be  retrieved  in  the  order  in  whi  ch 
they  appear  as  pairs  along  the  disk  track.   The  order  of  retrieval  is  as 
follows : 

Pick  up  A   and  retain  in  primary  memory  while 
the  first  row  of  B  is  input  and  multiplied.   That  is, 
bring  in  B, , ,  Blp,  ...  B   one  at  a  time  and  pre- 
multiply  by  A  , ,  forming  the  first  partial  sums  of 
the  first  row  of  C.  A  superscript  will  be  used  to 

denote  the  partial  sum.   Thus  C  ,  C  ,  ...  C   have 
been  computed.   This  resultant  data  is  handled  in  one  of 

two  ways. 

If  this  row  of  blocks  of  C  would  occupy  more  than  one 
rotation,   ,  when  placed  on  the  disk  with  the  same  spacing 
as  A  and  B  (Ps  >i),  the  C  partitions  may  be  output  as 
they  are  computed. 

If  this  row  of  C  would  not  occupy  too  much  of 
primary  memory  to  be  simply  contained  there  (that  is 
if  primary  memory  were  at  least  as  large  as  P  plus  a 
few  blocks)  these  blocks  may  simply  reside  in  primary 
memory* 


^6ihat  the  tW°  conditlons  are  n°t  mutually  exclusive  leaving  some  room 
for  choice  in  many  cases.  *  ame  room 
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The  operations  to  occur  now  are  the  multiplication 
of  A-p  by  the  first  row  of  blocks  of  B.  A-      will  have 
been  picked  up  from  the  disk  sometime  during  the  last 
rotation  before  it  was  needed.   The  first  row  of 
blocks  of  B  will  have  similarly  been  origin  shifted  such 
that  its  first  block  also  is  in  primary  memory  at  the 
time  at  which  multiplication  is  to  commence.   If  the 
first  partial  sum  row  of  C  was  output  to  the  disk,  its 
first  block  has  likewise  been  input  and  is  in  primary 
memory . 

Note  that  at  this  point  two  operations  will  be 
occurring:   Multiplication  of  an  A  block  by  a  B  block 
and  an  addition  of  this  result  to  the  first  partial 
sum  of  the  corresponding  C  partition,  yielding  the 
second  partial  sum  of  that  C  block. 

This  extra  addition  operation  (the  addition  of  the 
previous  partial  sum)  will  cause  the  time  between  suc- 
cessive operations  to  be  longer  than  it  was  for  just  the 
matrix  multiplication.   Thus  input  data  to  the  processor 
will  be  density  buffered  in  the  input  buffer. 

The  new  (second  partial  sum)  C  blocks  will  be 
density  buffered  in  the  output  to  the  higher  data 
rate  of  the  original  A  and  B  blocks  -  which  will 
remain  intact  on  the  disk. 

Successive  rows  will  follow  in  like  order: 
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The  next  block  from  the  first  row  of  A  is  multiplied 
by  the  next  row  of  blocks  of  B  and.  added  to  the 
respective  last  partial  sum  row  of  C  blocks.   The 
A  and  B  blocks  are  density  buffered  (slowed  down) 
in  the  input  buffer.   The  currently  generated  partial 
sum  C  blocks  are,  if  output,  density  buffered  (sped 
up)  in  the  output  buffer.   Finally  the  first  row  of 
blocks  of  C  will  have  been  computed  and  placed  on  the 
disk  at  a  spacing  equal  to  that  of  the  A  and  B  blocks. 
The  procedure  then  continues  for  blocks  from  successive 
rows  of  the  A  matrix  and  yielding  successive  rows  of 
the  C  matrix. 

This  procedure  effectively  sweeps  one  of  the  matrices  once  and  the 
other  many  times  to  produce  the  product  matrix.   There  are  obvious  variations 
in  which  the  original  matrix  blocks  may  be  stored  by  columns  or  diagonals 
instead  of  the  original  rows.  The  important  point  is  simply  that  the  one 
matrix  whose  data  are  required  most  rapidly  (in  this  case,  B)  is  on  the  disk 
at  a  sufficiently  high  density  to  be  supplied  to  the  processor  accordingly. 
The  remaining  matrices  may  be  stored  at  this  same  data  rate  and  retrieved 

at  the  slower  rate. 

Another  possibility  with  this  same  scheme  is  to  store  all  matrices 
in  like  fashion,  say,  in  rows.   The  density  at  which  they  are  stored  is  that 
highest  density  of  all  the  operations  which  occur  in  that  particular  program. 
If,  for  example,  a  program  consisted  of  only  the  standard  matrix  operations, 
the  A  and  B  matrices  of  the  previous  example  could  be  stored  with  their 
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blocks  spaced  appropriately  for  matrix  addition.   Thus  they  could  be  added 
with  no  latency  and  multiplied  just  as  described  in  the  example  except  that 
the  B  matrix  will  always  be  density  buffered  on  input.   This  would  effectively 
"slow  down"  the  data  from  the  addition  rate  to  the  multiplication  rate. 

3-U  Discussion 

This  section  has  given  several  ideas  for  approaches  to  the  parti- 
cular class  of  problems  mentioned.   No  algorithms  were  given  for  all  pro- 
grams in  this  class  because  one  would  almost  have  to  exhaustively  consider 
each  individual  program  (e.g.  each  possible  sequence  of  all  matrix  operations 
in  a  matrix  program) .  What  we  have  tried  to  show,  however,  is  that  for  any 
specific  program  a  zero  latency  solution  can  always  be  obtained  when  one  of 
the  following  conditions  is  satisfied: 

i)  the  i/o  sequences  have  some  identifiable 
regularity,  or 

ii)  the  input  data  bandwidth  requirements  of  the  program 
are  very  much  less  than  the  bandwidth  of  disk.  This  condition 
will  allow  skewing  of  data  blocks  on  the  disk  such  that  there  will 
be  little  if  any  overlap  between  blocks  of  different  entities. 
That  is,  additional  disk  bandwidth  capacity  will  help  preclude 
attempts  to  retrieve  more  than  one  data  block  from  identical 
disk  sectors  at  the  same  time. 

High  data  rate  programs  which  retrieve  input  data  in  non  regular  sequences 
may  qualify  for  zero  latency  solutions  as  generated  by  the  algorithms  in 
Chapter  k. 
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k.      ARBITRARY  PERMUTATIONS  OF  BLOCKS  BETWEEN  SWEEPS 

k.l     Definition  of  the  Problem 

This  chapter  deals  with  the  problem  evidenced  when  a  partitioned 
data  base  is  accessed  several  times  but  in  different  orders.  Specifically,  the 
blocks  remain  the  same  in  terms  of  the  variables  they  represent  but  are 
referenced  in  different  sequences  during  successive  passes  over  the  data  base. 
An  example  of  this  is  a  matrix  which  is  accessed  first  by  rows  of  blocks  and 
then  by  columns  of  blocks  or  diagonals.   The  accessing  need  not  be  as  regular 
as  rows  or  columns,  however.   This  chapter  deals  with  accessing  in  any  random 
order  for  each  pass  over  a  given  data  base. 

In  effect,  the  solution  given  provides  for  permutations  of  blocks 
of  a  data  entity  on  each  pass  through  that  data  entity.   This. is  accomplished 
as  with  previous  results  in  a  manner  transparent  to  the  processor.  Judicious 
use  of  the  disk  and  buffer  memory  will  provide  the  blocks  to  the  primary 
memory  in  their  permuted  order  continuously. 

This  section  will  explain  the  mechanism  for  arbitrary  permutations  only 
and  will  ignore  several  other  considerations.   For  the  purposes  of  this  section, 
then,  it  will  be  assumed  that  density  buffering  and  origin  shifting  are 
unnecessary.   Furthermore,  it  is  also  assumed  that  each  block  in  an  array 
will  be  accessed  exactly  once  during  each  pass.   Later  sections  will  deal  with 
combining  permutations  with  origin  shifting  and  density  buffering  and  with 
additions  and  deletions  of  blocks  while  they  are  being  permuted. 

The  mechanism  for  arbitrary  permutations  involves  the  following  steps: 

i)  Rectangularize  the  sequence  of  blocks  as  they  are  required  for, 
say,  the  initial  pass. 


56 

ii)   Obtain  from  the  resultant  matrix  a  sequence  of  alternating 
row,  column,  row  permutations  which  will  yield  the  desired  new  sequence  of 
blocks. 

iii)   Linearize  the  above  result  into  a  linear  sequence  of  the  blocks 

iv)  Relate  the  above  directly  to  the  operating  system. 

The  rectangularization  is  a  process  whereby  the  original  sequence 
of  blocks  is  represented  in  a  manner  which  shows  what  their  positional  layout 
would  be  on  the  disk  if  they  were  to  be  directly  output.   The  horizontal 
dimension  corresponds  to  the  number  of  such  blocks  which  can  occupy  a  physical 
track  (i.e.  a  single  rotation  of  the  disk).  The  vertical  dimension,  then, 
corresponds  to  the  number  of  disk  rotations  the  entire  set  of  blocks  will 
occupy  at  their  given  density.   This  is  given  more  formally  below. 

Consider  the  original  sequence  of  data  blocks  (the  order  from  which 
they  will  be  permuted) .  Without  loss  of  generality  these  may  be  relabelled  1 
to  n  in  the  order  of  this  original  sequence. 

These  n  elements  are  placed  in  a  rectangular  array.   The  width  of 
this  array,  or  horizontal  dimension,  h,  is  the  number  of  blocks  which  will 
occupy  one  physical  rotation  of  the  disk  at  the  required  input  data  rate  of 
the  process  which  operates  on  it.   In  general,  the  timing  may  be  such  that 
there  is  a  non- integral  number  of  partitions  corresponding  to  an  exact  disk 
rotation.   In  such  cases,  that  integer  just  less  than  the  non-integral  number 
of  partitions  which  fills  a  disk  rotation  will  be  used.   This  introduces  a 
small  amount  of  latency  into  the  solution.   This  latency  is  usually  negligible, 
however . 

The  height  or  vertical  dimension  of  the  matrix,  v,  corresponds  to 
the  number  of  rotations  that  all  of  the  n  blocks  will  occupy  on  the  disk. 
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That  is,  v  is  the  smallest  integer  not  less  than 


This  is  given  by  v  = 
h/n. 

This  matrix  is  shown  schematically  in  Fig.  k—1. 


Fig.  k-l 

Note  that  in  general  n  <  vh  and  the  matrix  has  been  "packed"  with  the  numbers 
n+1  to  vh  even  though  there  are  no  such  blocks.  Each  entry  in  such  a  matrix 
corresponds  to  a  particular  block  and  will  frequently  be  referred  to  as  an 
element  of  the  matrix. 

This  matrix  is,  quite  literally,  a  map  of  the  sequence  of  blocks 
on  the  disk  if  they  would  have  been  directly  output  to  the  disk.  That  is, 
each  row  represents  a  rotation  of  data  blocks  around  the  disk.  The  entire 
matrix  represents  the  entire  sequence  placed  on  the  disk  with  respect  to  some 
arbitrary  origin.  This  matrix  will  be  referred  to  as  the  original  matrix 
and  is  the  rectangularization  of  the  sequence  of  blocks. 
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The  sequences  of  permutations  over  the  rows  and  columns  of  the 
original  matrix  are  obtained  by  first  forming  the  resultant  matrix.  The 
resultant  matrix  is  obtained  in  a  manner  similar  to  that  of  the  original  matri> 
The  final  permuted  sequence  of  blocks  is  rect angular i zed  as  if  going  back- 
wards. That  is,  a  matrix  representation  of  the  n  partitions  is  obtained 
which  gives  the  positions  the  blocks  would  be  in  on  the  disk  if  they  were 
already  in  permuted  order. 

An  example  is  given  to  aid  in  the  explanation  of  the  algorithm 
which  yields  the  desired  transformations.  For  an  original  20  blocks  labelled 
1-20  in  the  order  they  were  computed  upon  on  the  original  pass,  assume  they 
are  next  needed  in  the  sequence  8,  12,  20,  5,  2,  13,  l8,  19,  1**-,  6,  10,  15, 
17  16,  1+,  3,  9,  7,  11,  1-  For  h  =  5,  v  =  h   the  original  and  resultant 
matrices  are  obtained  trivially  and  shown  in  Fig.  k-2. 
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What  will  be  obtained  will  be  a  set  of  row  permutations  followed 
by  a  set  of  column  permutations  followed  by  yet  another  set  of  row  permutations. 
These  permutations  will  be  such  that  when  successively  applied  to  the  original 
matrix  they  will  produce  the  resultant  matrix. 

A  set  of  row  permutations  consists  of  h  permutations  of  v  elements 
each.   That  is,  each  row  is  permuted  independently  and  according  to  the 
particular  pattern  determined  for  it.  Similarly,  a  set  of  column  permutations 
consists  of  v  permutations  of  h  elements  each,  where  each  column  permutation 
is  independent  of  the  others. 

k.2     The  Algorithms  for  Arbitrary  Permutations 

One  important  point  should  be  kept  in  mind  during  the  following 
discussion.  A  row  permutation  is  in  fact  a  switching  between  columns.   That 
is,  when  a  row  of  elements  is  permuted,  the  elements  change  columns.   Similarly, 
column  permutations  cause  row  changes  to  be  effected. 

Basically  the  sequence  of  row- column- row  permutations  which  yields 
the  total  permutation  over  the  entire  matrix  is  obtained  by  first  finding 
the  middle  step  -  the  column  permutations.   Since  the  sets  of  row- column-row 
permutations  are  not  in  general  unique,  any  satisfactory  set  of  column  permu- 
tations is  generated  first .   The  row  permutations  then  are  generated  by  working 
"outward"  from  the  set  of  column  permutations  just  obtained. 

Intuitively,  the  set  of  column  permutations  swaps  the  elements  from 
their  original  rows  into  the  rows  they  will  occupy  in  the  resultant  matrix. 
The  set  of  row  permutations  which  precedes  this  aligns  the  elements  so  that 
there  will  be  no  conflicts  when  they  are  row  swapped.   The  final  set  of  row 
permutations  merely  shifts  the  elements  into  their  positions  which  make  up 
the  resultant  matrix. 
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The  algorithm  which  describes  this  process  is  now  given,  performing 
the  operations  on  the  example  given  earlier. 

Standard  Permutation  Algorithm: 

I.  Construct  a  transition  matrix  which  shows  the  transitions  that  each  block 
makes  in  going  from  the  original  to  the  resultant  matrix.   This  matrix  will 
be  quite  large:  hv  by  hv.   In  the  standard  fashion,  we  use  a  1  to  indicate 

a  transition  and  0  or  blank  otherwise.   This  matrix  is  the  Total  Transition 
Matrix  or  TTM  and  is  shown  in  Fig.  k-3. 

II.  Abstract  from  the  TTM  a  matrix  which  just  shows  the  necessary  transitions 
between  rows  of  the  original  and  resultant  matrices. 

This  is  accomplished  by  first  partitioning  the  TTM  into  partitions 
which  represent  the  row  transitions  as  shown  in  Fig.  k-k.      That  is,  the  TTM 

is  partitioned  into  k  by  h  partitions. 

2  2 

The  partitioned  TTM  will  always  have  v  partitions  of  h  of  the 

original  entries  each. 

This  matrix  is  then  reduced  into  the  Row  Transition  Matrix  or  RTM. 
This  step  consists  simply  of  collapsing  the  partitioned  TTM  into  a  v  x  v 
matrix,  each  entry  of  which  is  the  sum  of  the  entries  in  the  corresponding 
position  of  the  partitioned  TTM.   For  example,  the  partition  in  the  first 
row,  first  column  of  the  partitioned  TTM  is  summed  to  yield  the  element  of 
the  RTM  which  appears  in  the  first  row  and  first  column.  The  RTM  for  this 
example  is  shown  in  Fig.  k—5-      The  Roman  numerals  I  to  IV  are  used  arbitrarily 
to  label  the  rows  1  to  k   of  the  original/resultant  matrices,  respectively. 

The  interpretation  of  the  RTM  is  as  follows:   The  entry  in  the  i 

row,  j   column  is  the  number  of  elements  in  the  i   row  of  the  original 

th 
matrix  which  appear  in  the  j   row  of  the  resultant  matrix. 
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Fig.  4-5   Reduced  Transition  Matrix 

Thus,  for  example,  the  entry  in  the  first  column,  first  row  indicates 
that  two  blocks  which  appear  in  the  first  row  of  the  original  matrix  also 
appear  (somewhere)  in  the  first  row  of  the  resultant  matrix.  These  two 
blocks  are  (from  the  TTM)  the  ones  which  move  from  position  2  to  position  5 
and  from  position  5  to  position  k. 

Note  that  the  sum  of  each  row  and  each  column  in  the  TTM  is  1.   The 
sum  of  row  and  each  column  of  the  RTM  must  therefore  be  h,  or  in  the  example,  5< 
III.   Decompose  the  RTM  into  h  individual  transition  matrices,  each  of  which 
will  correspond  to  a  single  column  partition. 

A  permutation  of  k  elements  can  be  represented  by  a  k  x  k  matrix 
in  which  the  sum  of  each  row  and  column  is  unity.   The  TTM  is  one  such  matrix. 
It  represents  a  permutation  of  hv  elements. 

The  existence  of  this  decomposition  is  shown  by  Ryser  [13 1  as 
Theorem  5*3  and  Hall  [8]  as  Theorem  5-1 -9'  Appendix  A  gives  the  listing  of 
a  program  which  performs  this  decomposition. 

For  the  example  given,  one  such  decomposition  is  shown  in  Fig. 
k-6. 
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Fie.  k-6 


IV.  Assign  each  of  these  component  matrices  to  a  column  to  obtain  the  column 
permutations.   The  sets  of  row  permutations  then  follow. 

These  component  matrices  are  then  referred  to  as  column  transition 
matrices.   Each  of  the  5  column  transition  matrices  in  Fig.  k-6   corresponds 
to  a  permutation  of  k   elements.   These  may  be  arbitrarily  assigned  to  the 
5  columns.   Thus,  if  the  matrix  in  k-6 (a)  corresponds  to  the  permutation  in 
the  first  row,  it  indicates  the  following  permutation:   the  element  in  the 
first  row  remains  there,  the  element  in  the  second  row  is  moved  to  the  third 
row,  the  element  in  the  third  row  is  moved  to  the  fourth  row,  the  element  in 
the  fourth  row  is  moved  to  the  second  row. 

For  the  purposes  of  this  example,  the  matrices  in  Fig.  4- 6(a)  to 
(e)  are  assigned  to  columns  1  to  5,  respectively. 

The  remainder  of  the  process  consists  of  identifying  the  particular 
elements  from  the  original  matrix  that  correspond  to  those  being  permuted  in 
the  columns.   From  this  information  the  row  permutations  are  easily  obtained. 

The  first  set  of  row  permutations  is  obtained  by  examining  the  set 
of  column  transition  matrices  and  comparing  these  to  the  partitioned  TTM,  a 
row  at  a  time. 
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In  the  example,  the  l's  appearing  in  the  column  transition  matrices 
are  assigned  to  particular  transitions  in  the  partitioned  TTM.   For  example, 
the  column  transition  matrix  which  corresponds  to  the  first  column  permutation 
has  a  non-zero  entry  in  row  1,  column  1.   This  is  matched  with  a  non-zero  entry 
in  the  partitioned  TTM  which  is  in  that  partition  that  is  in  the  first  row 
and  first  column  of  blocks.   In  this  particular  case  there  are  two  possibilities: 
the  transition  from  position  2  to  position  5  and  the  transition  from  position 
5  to  position  k.      Either  of  these  two  transitions  is  acceptable  for  the  matching. 

Suppose  it  is  the  transition  from  position  2  to  position  5  which  is 
matched  to  this  entry  in  the  first  column  transition  matrix.   Since  this  is 
the  first  column  permutation,  the  matching  has  the  following  significance: 

The  element  in  position  2  of  the  original  matrix 
should  be  (during  the  first  set  of  row  permutations)  per- 
muted to  column  1.   The  column  permutations  will  insure 
that  it  gets  relocated  in  the  correct  row  of  the  resultant 
matrix . 

Thus  the  particular  block  under  consideration  gets  "permuted"  to 
the  same  row  it  started  in  and  the  final  row  permutation  places  it  in  position  5' 
This  algorithm  can  be  concisely  given  as: 

For  each  non-zero  entry  in  a  particular  column 
transition  matrix,  match  it  with  a  particular  transition 
from  the  partitioned  TTM.   For  this  transition,  the 
element  will  first  be  row  permuted  from  its  position  in 
the  original  matrix  to  that  column  (in  the  same  row) 
which  corresponds  to  the  particular  column  for  the 
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column  transition  matrix  being  considered.  After 
being  column  permuted,  this  partition  is  then  row 
permuted  into  its  position  in  the  resultant  matrix. 

This  procedure  is  followed  for  the  remainder  of  the  example.  Fig. 
4-7  shows  the  column  permutation  matrices  in  which  the  numbers  of  the  blocks 
with  which  their  non-zero  entries  are  matched  have  been  indicated. 
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Fig.  4-7 

Since  the  matrices  (a)  to  (e)  shown  in  Fig.  4-7  correspond  to 
columns  1-5,  the  original  matrix  after  the  first  set  of  row  permutations  is 
immediately  obtained.   This  is  shown  in  Fig.  4-8 (a) . 
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Fig.  1]-- 8(b)  shows  the  matrix  of  Fig.  4-8 (a)  after  being  column 
permuted.   Fig.  4-8(c)  is  the  resultant  matrix  and  is  obtained  by  row  permu- 
ting the  matrix  of  Fig.  4- 8(b)  . 

An  equivalent  algorithm  is  given  by  Opferman  and  Tsa-Wu  in  [32]. 
They  are,  however,  concerned  with  switching  signals  instead  of  transmitting 
data  blocks,  but  the  analyses  are  identical. 

Thus  any  rectangular  array  of  elements  may  be  randomly  permuted  by 
a  set  of  row  permutations,  a  set  of  column  permutations,  and  a  final  set  of 
row  permutations. 

k.3     Permuting  in  a  Memory  Hierarchy 

The  preceding  method  is  implemented  for  a  head-per- track  disk  system 
by  performing  row  permutations  in  intermediate  (bulk  storage)  random  access 
memory.   The  first  row  permutation  is  accomplished  by  accumulating  blocks 
in  an  output  buffer  as  they  come  from  the  primary  memory.  When  a  row  is 
accumulated,  these  blocks  start  to  be  output  to  the  disk  in  their  permuted 
order.   The  final  row  permutations  are  performed  in  a  similar  manner.   One 
row  is  accumulated  in  an  input  buffer  before  being  routed  to  primary  memory. 
The  blocks  of  this  row  are  then  sent  to  primary  memory  in  their  permuted  order. 

Column  permutations  must  take  place  between  the  time  the  bbcks  leave 
the  output  buffer  for  the  disk  and  the  time  they  arrive  at  the  input  buffer 
from  the  disk.   They  are  accomplished  by  simply  switching  the  heads  of  the 
disk  appropriately.   Fig.  4_8(a)  shows  a  map  of  the  allocation  of  the  blocks 
on  the  disk.   Since  a  column  represents  a  particular  angular  area  on  a  disk, 
the  disk  may  read  any  of  the  blocks  in  a  particular  column  with  equal  ease. 
Thus  the  sequence  shown  in  Fig.  4-8(b)  can  be  obtained  by  switching  the  read 
heads  of  the  disk.   Note,  however,  that  the  effect  of  column  permutations 
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may  also  be  obtained  by  switching  the  write  heads  of  the  disk.   Thus  either 
the  read  or  the  write  heads  may  be  switched  to  achieve  the  effect  of  column 
permutations,  whichever  one  would  require  the  least  switching  time  on  a  par- 
ticular disk. 

The  amount  of  buffer  memory  required  is  simply  that  for  a  row  of 
blocks  in  the  input  buffer  and  an  equal  amount  for  the  output  buffer.   Since 
a  row  of  blocks  was  previously  defined  to  be  a  disk  rotation,  the  required 
buffer  memory  for  this  scheme  is  2Q. 

k.k     Special  Cases  of  Permutations  -  Addition  and  Deletion  of  Blocks 

The  previous  sections  have  dealt  only  with  permutations  over  the 
entire  data  base.  All  partitions  over  that  base  were  assumed  to  be  computed 
upon  for  each  iteration.   Real  programs  frequently  do  not  access  their  data 
in  that  fashion.   Some  blocks  may  be  abandoned  after  several  iterations  and/ 
or  some  may  first  appear  in  the  computation  only  after  some  number  of  iterations. 
This  section  deals  with  such  modifications  to  the  permutation  algorithm. 

The  deletion  of  blocks  after  having  been  used  for  several  iterations 
is  a  relatively  straightforward  procedure: 

Deletion  Algorithm: 

Assume  that  blocks  are  to  be  subsequently  output 
(otherwise  they  are  simply  overlayed  in  primary 
memory) .   The  resultant  permutation  matrix  for  the 
first  iteration  in  which  these  blocks  do  not  appear  is 
generated  to  show  these  blocks  at  the  end  of  all  those 
blocks  which  will  be  used.   That  is,  the  blocks  to  be 
deleted  occur  in  any  sequence  after  the  sequence  of 
blocks  which  are  not  deleted.   The  permutation  algo- 
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rithm  is  used  to  obtain  the  row-column-row  permutations 
as  before,  except  that  the  deleted  blocks  are  not  picked 
up  after  the  others.   Thus,  they  are  "shifted"  (permuted) 
to  the  end  of  the  line  and  then  truncated  from  the 
sequence  and  left  on  the  disk. 

Addition  of  blocks  at  iterations  other  than  the  first  is  accomplished 
in  much  the  same  way.   The  blocks  to  be  added  are  placed  on  the  disk  at  the 
locations  from  which  they  will  be  read  for  the  first  iteration  in  which  they 
appear.   These  locations  are  easily  obtained  from  the  standard  permutation 
algorithm. 

Addition  Algorithm: 

All  blocks  which  will  later  be  added  to  the  set  being 
permuted  are  entered  in  the  set  notationally  by  appending 
them  at  the  end  of  the  sequence  of  blocks  which  are  being 
used.   This  is  done  for  the  last  iteration  before  the  new 
blocks  are  to  be  used  and  they  thus  appear  in  the  original 
permutation  matrix.   They  are  placed  in  the  resultant 
permutation  matrix  at  those  places  at  which  they  are  required 
by  the  processor.   The  permutation  algorithm  will  generate 
the  usual  row- column- row  permutations  for  all  the  blocks. 
The  first  row  permutations  are  not  carried  out  for  the 
blocks  which  will  be  added,  however.   They  are  pre-placed 
on  the  disk  at  those  locations  which  the  permutation  algo- 
rithm would  have  placed  them  had  they  been  computed  upon 
during  the  previous  iteration. 
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Thus  blocks  may  be  added  and  deleted  at  any  time  during  the  itera- 
tive computations  over  the  data  base.  As  explained  in  the  next  section,  how- 
ever, blocks  may  not  be  deleted  and  later  added  by  application  of  the  deletion 
and  addition  algorithms. 

k . h . 1  Other  Special  Cases 

When  a  block  is  deleted  it  is  placed  by  the  permutation  algorithm 
at  some  particular  angular  position  on  the  disk.   This  position  may  be  any- 
where insofar  as  the  computational  process  is  concerned  because  it  is  finished 
with  this  block.   The  permutation  algorithm,  however,  puts  it  in  one  of  a 
limited  set  of  positions  which  are  consistent  with  its  goal  of  performing  the 
necessary  permutations. 

Blocks  to  be  added,  on  the  other  hand,  must  be  placed  in  particular 
angular  positions  on  the  disk  before  being  input.   In  general,  there  is  no  way 
to  insure  that  a  block  be  abandoned  at  precisely  that  position  from  which  it 
should  be  retrieved  for  subsequent  interations  over  the  data  base.   In  fact, 
cases  can  be  constructed  which  preclude  this  fortuitous  placing  of  blocks  on 
the  disk.  Thus  the  procedures  described  above  for  additions  and  deletions  do 
not  permit  the  addition  of  blocks  which  have  already  been  used  and  deleted. 

There  are  however  several  ways  in  which  this  situation  may  be  handled. 
For  processes  which  do  not  require  data  at  the  disk's  maximum  rate,  the  following 
methods  are  feasible.   (Processes  which  require  the  disk  to  be  constantly 
transmitting  data  must  have  additions  and  deletions  of  their  blocks  adhere 
to  the  constraint  implicit  in  the  section  on  additions  and  deletions.   That 
is,  they  may  not  delete  a  particular  block  and  then  add  that  same  block  several 
iterations  later.) 

When  the  process  requires  input  at  a  lower  rate  than  the  disk's 
maximum  capacity,  data  can  be  made  to  appear  to  "move"  on  the  disk.   This  is 
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accomplished  by  reading  from  the  disk  some  data  which  are  not  required  by  the 
process  for  some  time  to  come.   These  data  are  kept  in  buffer  memory  and 
subsequently  output  to  their  final  destination  on  the  disk,  without  having 
been  processed.   The  input  and  output  of  these  "moving"  data  are  performed 
interstitially  with  the  regular  input  and  output  of  blocks  for  the  current 
process. 

Another  way  to  handle  the  placing  of  deleted  blocks  so  that  they 
may  subsequently  be  added  is  to  simply  keep  the  deleted  blocks  in  buffer 
memory  until  the  disk  rotates  to  those  positions  corresponding  to  allowable 
locations  for  subsequent  addition.   This  is  possible  because  the  buffers,  if 
matched  to  the  data  rate  of  the  disk,  will  be  partly  unused  and  hence  available 
for  temporary  storage  of  blocks. 

Yet  another  method  which  makes  use  of  the  fact  that  the  process  is 
not  demanding  data  at  the  disk's  maximum  rate  is  useful  for  certain  situations. 
Suppose  the  entire  data  base  is  made  up  of  several  variables  over,  say,  a 
mesh.   On  successive  iterations,  different  variables  will  be  required  as  input 
to  the  process  with  perhaps  some  other  variables  in  common.   That  is,  some  of 
these  variables  are  required  on  every  iteration  while  others  are  required  only 
for  selected  iterations. 

One  way  to  handle  what  would  otherwise  be  a  quite  messy  addition 
and  deletion  problem  is  include  all  variables  for  a  particular  portion  of  the 
mesh  in  a  block.   The  entire  block  is  then  retrieved  as  an  input  block  while 
only  the  requisite  variables  are  made  use  of  (and  perhaps  updated) .   This 
will  work  when  the  virtual  data  rate  made  up  by  lumping  variables  in  blocks 
is  less  than  that  of  the  disk's  maximum  capacity. 
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U.5  Combining  Permutations  with  Origin  Shifting  and  Density  Buffering 

The  permutation  algorithm  stores  blocks  on  the  disk  in  such  a  way 
that  it  no  longer  is  reasonable  to  consider  them  as  a  continuous  sequence. 
Rather,  a  row,  or  rotation,  of  these  blocks  is  input  from  several  tracks  on 
the  disk.  These  blocks  are  further  permuted  before  being  used  by  the  compu- 
tational process.   The  concept  of  origin  shifting  deals  with  a  continuous 
stream  of  data  (in  this  case,  blocks)  which  must  be  anticipated  so  that  its 
origin  is  available  to  the  processor  when  needed. 

Since  the  data  blocks  when  being  permuted  have  no  real  origin,  there 
is  no  reason  for  the  operation  previously  described  as  origin  shifting.   Before 
a  process  will  be  able  to  use  its  first  row  of  input  data,  the  blocks  comprising 
this  row  must  already  have  been  read  from  the  disk.   This  is  because  a  row 
permutation  must  be  performed  first.   It  makes  no  difference  at  what  angular 
position  reading  was  initiated.   The  procedure  is  simply  to  begin  transmitting 
the  blocks  from  the  disk  to  buffer  memory  one  disk  rotation  before  they  must 
be  sent  on  to  the  primary  memory.  At  the  end  of  one  rotation,  the  necessary 
blocks  will  have  been  transmitted  to  the  input  buffer,  regardless  of  the  order 
in  which  they  were  transmitted. 

The  same  lack  of  data  origin  allows  for  density  buffering  in  much 
the  same  way.   The  algorithm  for  this  is  as  follows: 

Density  Buffering  Algorithm  (input) : 

One  disk  rotation  before  a  row  of  blocks  is  required 
in  primary  memory,  transmission  is  initiated  from  the 
disk  to  the  input  buffer.   This  input  ceases  after  a  single 
rotation  and  commences  again  one  disk  rotation  before  the 
next  row  of  blocks  is  required. 
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This  will  always  allow  for  the  permutation  of  data  blocks  in  suc- 
cessive rows  and  incorporates  what  was  previously  referred  to  as  density  buf- 
fering as  well  as  origin  shifting.  Note  that  this  entire  procedure  is  des- 
cribed as  taking  place  in  the  input  buffer.   Thus,  for  input,  no  additional 
buffer  memory  is  required  than  for  simply  the  permutation  algorithm. 

Output  of  data  is  handled  in  an  identical  fashion: 

Density  Buffering  Algorithm  (Output) : 

When  a  rotation  of  data  (at  the  density  at  which  it 
is  to  be  stored  on  the  disk)  is  accumulated  in  the  output 
buffer,  it  is  output  to  the  disk.   The  output  buffer  then 
continues  to  accumulate  blocks  until  another  rotation 
is  collected. 

In  this  way  mismatches  of  data  rates  are  provided  for  along  with 
reorderings  of  the  accesses  of  the  data  blocks.   The  necessary  buffer  memory 
is  still  2Q,  i.e.  Q,  each  for  the  input  and  output  buffers. 


7U 
5-   IMPLEMENTATION  CONSIDERATIONS 

5.1  Non-Zero  Data  Transmission  Times 

The  methods  used  in  this  chapter  and  indeed  those  of  the  remainder  of 
this  paper  assume  no  time  to  transmit  data  from  one  level  of  memory  to  another. 
This  assumption  is  made  simply  to  facilitate  explanation  of  the  techniques 
developed.   Taking  into  account  the  effects  of  real  transmission  times  is 
relatively  straightforward: 

Instead  of  retrieving  data  within  a  rotation  of  when 
it  is  required,  data  are  retrieved  within  a  rotation  plus 
the  total  transmission  time  between  the  disk  and  the 
processor.   That  is,  the  transmission  times  between  the 
disk  and  buffer  memory  and  between  buffer  memory  and 
primary  memory  are  added.   This  time  is  added  to  the 
time  of  a  single  disk  rotation  and  the  result  is  the 
new  "lead  time"  which  determines  how  soon  before  it  is 
needed  that  data  blocks  are  retrieved. 

This  will  always  be  a  very  small  correction  to  the 
time  relative  to  a  disk  rotation.   Nevertheless,  to  be 
totally  correct,  the  size  of  buffer  memory  should  also 
be  augmented  by  an  amount  equal  to  the  amount  of  data 
which  will  be  read  in  this  additional  increment  of  time. 

5.2  Memory  Sizes  and  Data  Rates 

An  upper  bound  on  the  size  of  the  buffer  memory  was  determined  to 
be  20   .  0    is  the  maximum  amount  of  data  which  may  be  transmitted  by  the 
disk  in  a  single  rotation.  Fig.  5-1  shows  some  of  the  larger  disks  currently 
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available  and  the  values  of  Q.    to  which  they  correspond.  Note  that  Fig.  5-1 
includes  movable  head  disks  as  well  as  head-per-track  disks.  This  is  not 
intended  to  imply  that  the  algorithms  presented  in  this  paper  are  directly 
applicable  to  movable-head  disks.   Rather,  these  entries  on  the  graph  are 
meant  to  be  indicative  of  current  disk  parameters.  It  is  noteworthy  that 
essentially  all  of  the  current  production  disks  yield  a  0    <  .5  Megabit. 
Furthermore,  most  of  these  are  under  .25  Megabit. 

The  preceding  leads  to  the  observation  that  a  separate  buffer  memory 
may  in  fact  not  be  necessary:   for  32-bit  words,  the  "buffer"  memory  require- 
ment need  be  in  the  neighborhood  of  8k  to  l6K  words.   Thus,  for  a  system  com- 
prised of  a  processor  and  any  standard  disk,  the  input  and  output  buffers  may 
well  be  sections  of  primary  memory. 

Not  all  processor /disk  systems  can  operate  in  this  fashion,  however. 
Some  configurations  may  contain  processors  which  do  not  provide  for  the  neces- 
sary additional  primary  memory.  Another  possibility  is  a  system  making  use 
of  the  type  of  disk  which  is  currently  being  manufactured  for  the  ILLIAC  IV 
computer  by  the  Burroughs  Corporation. 

These  are  head-per-track  disks  with  a  period  of  k-0   msec  and  a  data 

rate  of  500  Megabits  per  second.   Q    for  such  a  disk,  then,  is  500  x  10  x 

-3         6 
k-0   x  10   =  20  x  10  bits.   The  necessary  buffer  memory  for  a  system  making  use 

of  such  a  disk  is  ^0  Megabits  if  the  processor  is  likely  to  require  data  at  the 

disk's  rate.   Most  extant  processors,  however,  will  require  data  at  a  somewhat 

lesser  rate.   Fig.  5-2  shows  just  what  this  data  rate  would  be  for  various 

"fast"  processors  as  a  function  of  the  number  of  operations  per  variable  which 

will  occur  during  the  computational  process.   The  size  of  each  buffer  determined 

by  these  data  rates  is  termed  0    . 
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5.3  Speed  of  the  Buffer  Memory 


The  required  bandwidth  of  the  buffer  memory  is  the  sum  of  the  data 
rates  along  all  paths  to  and  from  it.   Fig.  5-3  schematically  shows  the  buffer 
memory  and  its  data  paths.   Thus  the  buffer  memory  need  have  a  bandwidth  of 
exactly  four  times  that  of  the  disk. 
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Fig.   5-3 


Note  that  implicit  in  Fig.  5-3  is  the  assumption  that  data  are  routed  to 
the  primary  memory  from  the  buffer  memory  in  blocks.   These  blocks  need  not, 
however,  be  identical  to  those  which  are  recorded  on  the  disk.  Although  this 
paper  does  not  mention  the  need  for  such  procedures,  the  buffer  may  be  used 
as  a  staging  area  for  reformatting  data.   If  it  could  provide  the  data  at  the 
proper  rates,  there  is  no  reason  why  it  could  not  clump  k  blocks  from  the 
disk  into  a  single  block  which  gets  routed  to  primary  memory.   Similarly, 
blocks  from  primary  memory  could  be  clumped  into  larger  blocks  for  disk  storage. 

5 .k     Implementation  of  Row  Permutations 

As  stated  in  the  beginning  of  this  paper,  the  algorithms  described 
in  this  paper  are  predicated  upon  knowledge  at  compile  or  run  time  of  what 
the  data  accesses  will  be.   It  is  a  relatively  straightforward  procedure  to 
determine  at  that  time  what  the  input  and  output  sequences  and  permutations 
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will  be.   This  is  done  by  making  use  of  the  known  data  block  access  sequences 
and  the  timing  characteristics  of  the  processor  and  disk. 

The  i/o  sequences  and  permutations  obtained  before  actual  execution 
may  well  take  up  more  than  the  available  primary  memory.  How,  then,  can  this 
information  be  stored  so  as  to  be  made  use  of  during  execution?  One  way  is 
as  follows. 

For  each  disk  rotation  during  execution  while  permuting  blocks,  the 
following  information  must  be  available  to  the  system. 

i)   a  sequence  listing  the  (at  most  h)  tracks  from  which  the  next 
rotation's  blocks  will  be  retrieved.   These  can  be  track  numbers  without  sector 
information,  but  if  the  entire  disk  address  is  used,  it  provides  complete 
information  about  the  next  set  of  inputs  from  the  disk  as  well  as  synchronizing 
the  actual  computation  process  with  the  disk. 

ii)   a  similar  sequence  to  the  one  above  for  output  addresses. 

iii)   two  sets  of  permutations  -  one  for  the  input  buffer  and  one  for 
the  output  buffer.   Each  of  these  is  a  permutation  of  h  elements.   Thus  each 
permutation  may  be  (wastefully)  stored  as  a  string  of  h  integer  numbers. 
Suppose  further  that  a  whole  word  is  to  be  used  for  each  such  number. 

i),  ii),  iii)  above  consist  of  at  most  ^h  words,  h  is  bounded  by 
the  number  of  addressable  sectors  on  the  disk,  but  will  obviously  be  typically 
somewhat  less.   For  h  =  6k,    a  block  of  256  would  be  necessary  for  each  disk 
rotation  of  data. 

A  block  such  as  this  could  simply  be  stored  on  the  disk  inter stitially 
with  respect  to  the  data  blocks.   It  would  appear  once  for  each  rotation  of 
data  which  will  be  retrieved.   Note  that  in  the  case  of  density  buffering,  n 
rotations  of  data  may  actually  be  retrieved  in  more  than  n  physical  disk  rota- 
tions.  Thus  these  i/o  blocks  are,  in  general,  retrieved  at  a  frequency  less 
than  one  per  disk  revolution. 
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Each  such  block  contains  essentially  an  input  request  queue,  an 
output  queue  and  two  permutation  sequences.  These  sequences  may  be  assigned 
as  descriptors  or  tags  to  their  respective  blocks  as  they  enter  the  buffers  or 
they  may  be  encoded  in  tables  by  which  the  system  will  route  the  blocks  and 
thus  effect  the  permutations. 

Alternatively,  routing  may  be  accomplished  by  hardware  techniques. 
One  possible  method  is  through  the  use  of  a  Batcher  network  [2].   In  such  a 
system,  each  data  block  will  enter  an  input  or  output  buffer  with  a  tag  which 
indicates  what  its  permuted  position  should  be.   The  transmission  from  these 
buffers  will  be  through  a  Batcher  network  and  thus  automatically  perform  the 
row  permutations.   In  such  an  implementation,  each  data  block  need  only  store 
a  single  address  for  each  row  permutation  it  undergoes  -  that  one  which  indicates 
its  permuted  destination. 

Requests  for  rows  of  data  blocks  may  be  handled  by  hardware  means 
also.  At  the  time  at  which  a  new  row  must  start  to  be  input  (one  rotation 
plus  transmission  times  before  it  is  needed),  the  addresses  of  the  blocks 
comprising  that  row  are  sent  to  a  queue r  of  the  type  manufactured  by  Burroughs 
Corporation  for  use  in  conjunction  with  the  ILLIAC  IV  disks.   These  queuers  will 
cause  the  requests  to  be  serviced  in  the  fastest  order  possible.   In  this  case, 
then,  the  order  of  the  block  requests  may  be  relative  to  the  arbitrary  data 
origin  on  the  disk  but  the  blocks  will  start  to  be  input  immediately  in  a 
"wrap-around"  sequence.   The  destination  tags  on  each  tag  will  then  insure 
that  the  row  permutation  is  accomplished  correctly. 

One  obvious  observation  which  may  be  made  at  this  point  is  that  many 
permutation  patterns  will  be  cyclical.   That  is,  there  may  be  two  basic  regular 
accessing  patterns  which  alternate  over  the  data  base.   In  such  cases,  the  use 
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of  unique  i/o  "blocks  for  each  rotation  of  data  would  probably  be  unnecessary. 
A  periodic  pattern  of  accesses  would  probably  suffice* 

5.5  Disks  Which  Do  Not  Read  and  Write  Simultaneously 

The  algorithms  and  methods  presented  in  this  paper  have  been  pre- 
dicated upon  the  existence  of  a  disk  which  can  simultaneously  read  and  write 
(although  not  with  the  same  head  or  on  the  same  track) .   That  is,  data  blocks 
are  assumed  to  be  read  from  the  disk  according  to  a  particular  algorithm  and 
written  on  the  disk  according  to  some  other  algorithm.   Never,  however,  has 
this  paper  been  concerned  with  constraints  upon  the  times  when  reading  and 
writing  may  occur.   Indeed  it  is  probably  less  efficient  to  introduce  such 
constraints  than  to  assume  concurrent  read/write  capabilities  of  the  disk 
and  then  simulate  this  facility  on  the  standard  disk. 

The  mechanism  by  which  a  disk  can  be  made  to  appear  to  both  read 
and  write  at  the  same  time  is  relatively  straightforward.   Each  physical  track 
on  the  disk  is  divided  into  an  even  number  of  sectors.   The  size  of  these 
sectors  is  such  that  each  may  be  addressed  as  a  unit.   For  reasons  >hich  will 
shortly  become  clear,  these  sectors  should  also  be  relatively  small  within 
the  limits  of  the  above  constraint. 

These  sectors  are  then  considered  as  adjacent  pairs.   Thus  from 
some  arbitrary  rotational  origin  on  the  disk,  each  physical  track  will  consist 
of  some  number  of  pairs  of  addressable  sectors.  As  the  reader  might  expect, 
each  pair  is  considered  to  be  a  read/write  unit.   That  is,  for  each  track- 
sector  pair,  one  sector  will  be  used  to  write  data  on  and  the  other  to  read 
data  from.   In  this  way  a  disk  i/o  supervisor  can  at  a  macroscopic  level  appear 
to  both  read  and  write  at  the  same  time  while  in  reality  it  is  only  reading 
(at  most)  half  of  that  time  and  similarly  writing. 
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Note  that  this  method  cannot  assign  particular  track- sectors  permanent 
read  status  and  others  permanent  write  status.  Adjacent  pairs  of  these  units 
are,  as  a  pair,  assigned  the  duty  of  splitting  up  the  read/write  responsibility. 
This  is  because  a  track-sector  which  is  at  one  time  used  to  write  data  to  must 
(if  it  is  to  be  useful)  later  be  used  to  read  data  from. 

After  data  is  read  from  a  particular  track  sector  it  may  be  considered 
a  write  unit.   If  it  is  the  first  half  of  a  read/write  pair,  the  second  half 
will  already  be  a  write  unit  and  may  be  thus  used.   If  the  second  half  of  the 
pair  is  not  immediately  used,  either  unit  is  available  for  a  write.  As  soon 
as  one  half  of  the  pair  gets  written  on,  however,  it  becomes  the  read  portion. 
Thus  available  unused  i/o  pairs  may  be  thought  of  as  write/write  pairs. 

Write  portions  of  a  read/write  unit  may  only  be  used  (written  in) 
when  an  adjacent  input  will  read  the  contents  of  its  read  portion.   That  is, 
we  disallow  read/read  pairs.   In  this  way,  read/write  pairs  are  maintained  as 
either  read/write  pairs  (or  released  by  being  converted  to  write/write  pairs) 
although  the  assignment  of  the  read  or  write  duties  will  change  within  the 
pair. 

As  an  algorithm,  this  can  be  stated  as: 

The  current  write  sector  is  the  mate  to  the  current 
read  sector  (which  is  uniquely  defined  by  the  algorithm 
which  provides  input  data  to  the  processor's  memory). 

If,  as  happens  in  come  cases,  more  output  data 
are  generated  than  the  input  data  being  used,  the  output 
destinations  are  assigned  from  sector  pairs  on  the  disk 
which  are  not  currently  being  used. 
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Note  that  this  method  of  making  the  disk  appear  to  both  read  and 
write  will  impose  some  additional  restrictions.   Since,  in  reality,  data  blocks 
are  not  being  both  read  and  written  at  the  same  time,  an  additional  amount  of 
buffering  will  be  necessary.   For  example,  a  situation  in  which  blocks  are 
assumed  to  be  directly  output  to  the  disk  will,  in  reality,  require  the  buffer- 
ing of  an  amount  of  data  equal  to  that  contained  in  one  half  of  a  read/write 
pair.   Typically  this  should  be  the  size  chosen  as  a  data  block  and  an  additional 
amount  of  buffer  memory  equal  to  two  data  blocks  (one  each  for  each  buffer)  in 
size  is  added  to  the  buffer  memory  requirements.   This  additional  amount  of 
memory  is  incremental  relative  to  the  2Q  already  required. 

A  more  serious  consequence,  however,  comes  from  the  fact  that  the 
disk  now  cannot  be  read  at  its  maximum  data  rate.   Similarly  it  cannot  be 
written  to  at  this  maximum  data  rate  either.   Thus  the  effective  data  transfer 
rates  for  disks  being  used  in  the  above  fashion  is  one  half  that  of  their 
maximum  transfer  rates.   This  has  the  desirable  effect  of  halving  the  necessary 
buffer  memory  but  the  unfortunate  effect  of  precluding  zero  latency  solutions 
for  processes  which  require  data  at  rates  greater  than  the  new  effective  data 
transfer  rate. 

The  result  of  this  may  be  the  requirement  to  use  a  disk  with  a 
higher  data  transfer  rate  to  simulate  one  with  read/write  capabilities  at  a 
lesser  data  transfer  rate. 

One  more  consequence  of  this  simulation  is  the  relatively  minor 
effect  of  the  decreased  data  storage  capacity  which  comes  about  from  partition- 
ing the  disk.   Each  access  from  disk  may  require  some  unused  time  (=  space) 
for  switching  between  heads  on  the  disk.   The  read/write  pairs  may  be  chosen 
to  be  relatively  small  due  to  buffer  memory  considerations.   Their  small  size, 
however,  will  cause  more  data  transmission  initiations  and  hence  a  small 
degradation  of  the  disks  data  handling  abilities. 
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5.6  Real  Data  Block  Spacing  and  Synchronization 

Throughout  this  paper  we  have  relied  on  the  predictability  of  execu- 
tion times  of  the  computational  processes.  This  has  permitted  the  spacing 
between  data  blocks  such  that  their  retrieval  rate  from  the  disk  exactly 
corresponded  to  their  computation  rate  by  the  processor.  Any  real  processor, 
however,  cannot  be  relied  upon  to  be  totally  predictable  to  the  /^second  level 
(which  may  well  be  called  for) .  How,  then,  do  such  methods  as  those  presented 
here  remain  practical  for  real  processors? 

The  above  question  is  in  part  answered  by  the  fact  that  the  buffer 
memory  acts  as  a  cushion  to  allow  a  slight  drift  in  execution  time  of  a 
process.   If  the  process  takes  less  than  its  predicted  time  for  some  data 
block,  the  next  block  is  already  in  buffer  memory  when  the  first  is  completed. 

If  the  processor  consistently  takes  less  than  the  predicted  time, 
eventually  the  data  available  in  the  buffer  will  be  depleted  and  the  processor 
will  become  synchronized  to  the  new  input  as  it  becomes  available.   In  such 
a  case,  the  prediction  of  compute  time  was  probably  in  error. 

In  the  case  of  the  processor  continuously  taking  too  much  time,  the 
buffer  will  eventually  fill  to  capacity.  At  this  point  a  supervisor  program 
should  terminate  reading  from  the  disk  to  the  input  buffer  for  a  rotation. 
Again,  this  is  the  result  of  an  erroneous  prediction  of  the  computation  time. 
In  either  of  these  two  cases,  no  severe  problem  develops.   The  processor 
eventually  computes  at  the  slower  of  its  natural  rate  and  that  of  the  input 
data  available  to  it.   In  the  case  where  the  processor  computes  at  a  rate 
slower  than  it  could  if  its  input  data  were  always  available,  the  slowdown 
is  essentially  caused  by  rotational  latency. 

Note,  however,  that  in  such  a  case  the  data  rate  and  process  are 
still  "closely"  matched.   This  has  the  consequence  of  introducing  much  less 
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latency  than  if  no  preplanning  had  been  undertaken.   Thus  bad"  predictions 

of  compute  times  still  yield  "good"  results  relative  to  no  predictive  planning. 

5.7  Obtaining  Gompile-Time  Information  on  Data  Accesses 

Pre-run  information  on  the  sequences  of  data  blocks  to  be  required 
by  a  program  could  be  obtained  by  modifying  an  analyzer  of  the  type  described 
in  [10] .   The  analyzer  could  be  modified  to  obtain  possible  sequences  over  a 
processes  input  data.  A  sequence  would  be  a  sequential  "trace"  over  the  com- 
putations which  the  analyzer  currently  finds  may  be  performed  in  parallel. 

Accurate  timing  information  may  be  obtained  at  this  same  time.   For 
each  computational  process,  the  particular  operations  per  datum  (or  per  mesh 
point  in  some  cases)  are  obtained.   (Note  that  in  some  cases  this  information 
may  be  dependent  upon  the  block  size  also,  thus  yielding  the  number  and  types 
of  operations  per  datum  as  function  of  the  block  size.)   The  compute  time 
per  block  then  comes  from  computing  the  actual  time  to  perform  the  particular 
operations.   This  should  include  any  non-processing  time  taken  for  retrieval 
from  primary  memory,  data  alignment,  etc .   Such  timing  information  can  obviously 
also  be  obtained  empirically  by  running  a  program  or  portion  thereof  beforehand. 

5.8  The  Effects  of  Block  Size  on  Memory 

For  some  processes,  the  block  size  chosen  will  have  an  effect  on 
the  buffer  memory  and  disk  characteristics.   This  occurs  when  the  number  of 
operations  performed  upon  each  variable,  while  constant  for  an  entire  iteration 
over  the  data  base,  will  be  related  to  the  block  size.   That  is,  as  the  block 
size  is  varied  the  number  of  operations  performed  per  variable  each  time  a 
block  is  accessed  will  also  vary.  An  example  of  this  is  matrix  multiplication. 
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5.8.1  What  the  Effect  Is 

In  multiplying  two  n  by  n  matrices,  a  total  of  n  multiplications 


will  occur.   If  the  blocks  consist  of  k  by  k  partitions  of  the  matrices,  there 

3 
will  be  k  multiplications  for  each  access  of  a  block  from  each  matrix.  There 

will  be  (r-l  blocks  and  each  block  will  be  accessed  —  times.   (The  number  of 

multiplies  per  block  accessed  times  the  number  of  accesses/iteration  times  the 

number  of  blocks  in  a  matrix  gives  n  multiplications.) 


The  required  input  bandwidth  B  for  this  problem  with  the  above 

w 

partitioning  scheme  is  approximated  by  considering  only  the  multiplications 

^  (3  blocks/multiplication)  x  (size  of  each  block) 

w  ~  (number  of  multiplies/2  block  pair)  x  (time /multiply  for  parti- 

cular  machine) 

letting  t  represent  the  processor's  multiply  time, 


3  -3r 

w     J 

k^t 


kt 


m 


m 


If  the  block  size,  B  ,  were  increased  by  doubling  the  linear  dimen- 
sion  of  the  partition,  we  obtain 


BT  =  2k  by  2k  =  k-'k 

Li 

(2k)   =  8k  multiplications  per  block 


Trr-  j  =  T-(r-)  blocks  per  matrix 
2k/    \  \kj 


yacy 


— -  accesses  of  each  block 
2k 


There  will  be  three  blocks  input  for  each  multiplication  when  the  partial  sum 
from  a  previous  multiplication  must  be  retrieved  and  updated  after  each  current 
multiplication . 
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Thus  the  bandwidth  is  halved  by  quadrupling  the  block  size.   This  follows  by 

3/2 

recognizing  that  matrix  multiplication  is  an  N    process.   That  is,  for  N 

3/2 

elements  in  each  matrix  (n  x  n)  there  are  N    multiplications  over  the  entire 

3/2 

matrix  multiplication.   For  K  (=  k  x  k)  elements  per  block,  there  are  K ' 

multiplications  and  the  (input)  bandwidth  which  is  measured  in  bits  or  words 
per  second,  comes  from 


B  -    3K      -  -2- 

m 

3/2 
For  an  N    process,  then,  increasing  the  block  size  by  a  factor  of 

P  will  result  in  a  decrease  of  the  necessary  buffer  memory  bandwidth  by  -i— . 

In  general,  for  an  N  process  (r  >  l)  an  increase  of  block  size  by  a  factor 

of  P  will  reduce  the  required  bandwidth  by  the  factor  =-.   (We  make  the 

P 
assumption  here  that  the  required  output  bandwidth  is  equal  to  that  for  input. 

This  is  very  conservative  since  few  operations  "generate"  more  new  data  than 

they  "consume".) 

Increasing  the  block  size  for  an  0(w)  process  (r  >  l)  has  several 
effects  on  the  various  memories  in  the  hierarchy.  The  most  obvious  of  these 
are 

i)   As  previously  shown,  the  required  bandwidth  from  the  buffer 
memory  (as  well  as  the  disk)  is  reduced,  and 

ii)   The  size  of  buffer  memory  decreases.   This  is  because  its 
required  size,  comes  directly  from  the  bandwidth  and  the  disk's  rotation  time. 
Buffer  memory  =  2Q,  =  2-B  -T  . 
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At  first,  then,  it  may  seem  that  things  continue  to  improve  as  the 
block  size  gets  "bigger,  or,  for  r  =  1,  at  least  remain  constant.  Why  not  just 
make  it  as  big  as  possible  and  thereby  reduce  the  required  buffer  memory  size 
and  the  buffer  memory  and  disk's  bandwidth?  This  turns  out  not  to  be  a  suc- 
cessful approach  for  all  values  of  r  for  two  reasons: 

i)  The  buffer  memory  must  be  at  least  large  enough  to  contain  a 
single  block.   Thus,  the  quantity  2Q,  is  insufficient  when  B  >  2Q. 

Li 

ii)  The  primary  memory  will  have  to  hold  a  few  blocks.   Typically 
its  requirement  will  be  about  three  to  five  blocks. 

One  further  observation  should  be  made  at  this  point.  When  the 
block  size  (or  the  combined  size  of  two  blocks  in  the  case  of  a  binary  opera- 
tion such  as  matrix  multiplication)  exceeds  Q,,  there  is  no  longer  any  reason 
to  make  use  of  the  algorithms  developed  in  this  paper.  An  input  buffer  equal 
to  the  block  size  is  sufficient  to  eliminate  all  latency.   This  is  accomplished 
by  the  straightforward  procedure  of  picking  up  each  block  (or  pair  of  blocks) 
in  the  last  rotation  before  they  are  needed.   Output  blocks  are  written  directly 
on  the  disk.   This  will  work  for  random  accesses  over  the  entire  data  base 
as  well  as  for  different  speeds  of  the  computational  processes  which  operate 
on  the  blocks . 

The  curve  in  Fig.  5-k   depicts  the  required  buffered  memory  at  that 
point . 

We  define  an  s-ary  process  to  be  one  in  which  s  data  blocks  are 
required  to  be  in  primary  memory  for  the  process  to  compute.  As  an  example, 
matrix  multiplication  of  whole  matrices  or  single  partitions  is  a  binary 
process.  Matrix  multiplication  of  partitioned  matrices  (which  includes  the 
addition  of  partial  sums  as  well  as  the  multiplication  process)  may  be  con- 
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sidered  a  ternary  process.   (Actually  it  is  only  macroscopically  a  ternary 
process  while  locally  it  is  a  (binary)  multiplication  process  followed  by  a 
(binary)  addition  process  but  it  is  the  macroscopic  level  with  which  we  are 
concerned.) 

Since  the  memory  and  bandwidth  requirements  go  up  as  the  number  of 
blocks  which  must  be  accommodated  at  a  single  time,  we  define  the  virtual 
block  size,  B  ,    of  a  process  to  be  the  number  of  blocks,  s,  which  must  be  co- 
resident in  primary  memory  times  the  size  of  each  block,  BT .  Thus,  B  =  sB_ . 

h  V        Jj 

In  the  equations  which  follow,  buffer  memory  is  indicated  by  the 
symbol  B  . 

From  Fig.  5-4  it  is  clear  that  if  there  are  no  additional  consider- 
ations, the  optimum  block  size  to  minimize  the  size  of  buffer  memory  is  simply 
that  which  yields  one  block  retrieval  from  the  disk  on  each  rotation. 
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An  0(Nj  process  is  one  in  which  the  dominant  operation  is  performed 
aB  times  for  each  virtual  block  (typically,  1  <  a  <  10)  we  denote  the  time 
to  perform  this  operation  by  t  . 

The  necessary  input  disk  bandwidth  is  the  number  of  data  words  per 
computation  on  a  block  divided  by  the  number  of  dominant  operations  in  a 
block  computation  times  the  time  for  that  operation. 


SBL 


aBT  t    aBT  t 
L  p     L   p 


Buffer  memory  =  2Q  for  the  left  portion  of  the  curve  in  Fig.  5-U 

2sT 

B  =  2Q  =  2B  T,  =  =■ —    for  BT  <  Q 

m         w  d    T,r-1,  L 

aBT  t 
L   p 


and       B  =  BT    for  BT  > 
m    L        L 


and  we  obtain  the  point  of  discontinuity  in  terms  of  the  block  size: 

sT 

BT  =  Q  =  B  T ,  = 


L        w  d    T,r-1, 
aB_   t 
L   p 


sT\ 
r     d 
or,       B_  =  — r—  . 
L   at 
P 


2T 
That  is,  for      B^  <  ^-   ,     Bm  =  2Q 

P 


2T 
and  for  B^  >  — E  ,     Bm  =  BL< 

P 
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If  primary  memory  contains,  say,  p  blocks,  then  its  size  is  pB  (s  <  p  <  10) 
and  one  can,  then,  compute  the  block  size  which  actually  minimizes  the  com- 
bined cost  of  buffer  and  primary  memory. 

If  the  buffer  memory  cost  is  d/(unit  of  memory)  and  the  primary  is 
c/(unit  of  memory),  we  want  the  minimum  of  the  function 


r 


2sT. 


cost  =  d  | 


r-1 
i  aBT   t 
L  L   p_j 


+  c[pBL] 


This  minimum  will  always  occur  before  the  discontinuity  in  buffer  memory  size. 
We  prove  this  by  solving  for  the  minimum.  Solve  for  the  deriative  of  the  cost 
function  with  respect  to  block  size  and  set  to  zero. 


^gost>  =  -(r-l)d 
oB_ 

-L 


2sT. 


aB^t 
_  LP. 


+  cp  =  0 


which,  with  suitable  algebraic  manipulation  yields 


£ 


"2sT  d(r-l)1 

aT  cp    ! 
P 


or 


BL  = 


2sT,d(r-l)" 
d 


aT  cp 
P 


l/r 


The  discontinuity  from  our  earlier  calculations  is 


BL  = 


sT. 


aT 


l/i 


P  ! 
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The  minimum  of  the  cost  function  occurs  at 


n 


2d(r-l)  jl/r 


cp 


_j 


want  to  consider  the  possible  values  for 


2d(r-l)  !l/r 


sT. 
c 

aT 

L  PJ 


iA 


and  we 


We  note  first  that  r 


L  ^P    ; 
is  positive  (since  r  >  l)  and  that  each  of  the  terms  is  also. 

The  quantity  —  will  always  be  <1,  in  fact  it  will  almost  always  be 
in  the  neighborhood  of  0.1. 

For  most  practical  algorithms,  r  will  not  be  larger  than  2  or  3,  let 
us  take  It  as  an  upper  bound. 

As  mentioned  earlier,  1  <  s  <  p  <  10  and  if  we  take  the  very  con- 
servative amount  of  1,  the  above  has  an  upper  bound  of 


■(2) (0.1) (3)  ll/r  _  ,  6]l/r 
J 


1 


and  since  this  is  <  1  for  our  conservative  estimates  the  minimum  of  the  cost 
function  will  always  occur  for  a  smaller  block  size  than  that  at  which  the 
discontinuity  occurs. 

Fig.  5-5  demonstrates  some  of  these  relationships  for  the  following 


values 


s  =  1 

P  =  5 

t  =  30  nsec 
P 

T.,  =  40  msec 
d 

r  =  3/2 

a  =  3 

d  =  $.12/word  »  $.02/bit  for  6k   bit  word 

c  =  $1.20/word  -a  $.20  bit  for  6k  bit  word 
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That  point  at  which  the  total  memory  cost  is  a  minimum  (from  Fig. 
5-5)  is  approximately: 
Buffer  memory  =  ^2K  words 
Primary  memory  =  2K  words 
Total  memory  cost  =  $7.2K 

It  must  he  borne  in  mind,  however,  that  there  is  an  additional  con- 
straint which  must  be  considered  that  does  not  appear  on  the  graph  of  Fig. 
5-5«   At  some  point  the  block  size  is  too  small  for  the  bandwidth  of  the  disk 
to  even  allow  a  solution  of  the  kind  presented  here.   This  threshold  can  be 
represented  by  a  horizontal  line  corresponding  to  20    for  the  particular 
disk  being  considered.   Block  sizes  small  enough  to  require  a  buffer  memory 
larger  than  this  are  not  feasible  for  our  solutions,  because  even  if  we  could 
merely  increase  the  buffer  memory  size,  the  bandwidth  of  the  disk  would  be 
inadequate . 

5.8.2   Processes  in  Which  Changing  the  Block  Size  Has  No  Effect 

Processes  which  are  0(w)  will  require  aB  operations  for  each  s 

blocks  of  size  B  .   The  bandwidth  and  buffer  memory  size  are  thus  not  dependent 
J_j 

upon  the  block  size  as  long  as  the  block  size  does  not  get  larger  than  is 
necessary  to  be  processed  for  an  entire  disk  rotation.   Thus  such  a  process 
will  have  its  buffer  memory  size  and  cost  plotted  as  a  horizontal  line  on  a 
graph  such  as  that  of  Fig.  5-5* 

Minimizing  the  memory  cost  in  this  case  is  accomplished  by  deter- 
mining the  smallest  possible  block  size  consistent  with  the  various  hardware 
and  software  constraints  imposed  by  the  system. 
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6.   CONCLUSIONS  AND  DISCUSSION 

The  algorithms  and  other  methods  presented  in  this  thesis  may  be 
directly  applied  to  a  significantly  large  class  of  problems.   The  important 
requirements  for  any  program  are  that  it  iterate  over  its  data  base  or 
portions  thereof  and  that  the  i/o  sequences  required  during  these  iterations 
and  their  timing  be  predictable.   Obvious  examples  of  such  programs  are 
numerical  partial  differential  equation  programs  with  large  data  bases  (e.g. 
weather  codes)  and  programs  which  contain  matrix  operations  for  very  large 
matrices  (which,  for  standard  matrix  operations,  are  inherently  iterative  and 
always  predictable). 

We  may  treat  these  latency  removing  techniques  as  either  system 
design  or  programming  tools.   In  the  former  case  we  can  obtain  various  bounds 
which  will  indicate  the  necessary  hardware  parameters  for  the  memory  heir- 
archy  that  will  fit  the  program  and  processor  parameters.   In  the  latter  case, 
we  can  determine  whether  a  latency  free  execution  is  possible  for  the  parti- 
cular system  and  program. 

6.1  Applications  of  Latency  Elimination  Techniques  to  Existing  Systems 

The  application  of  our  techniques  to  existing  systems  is  chiefly 
to  determine  if  they  will  "fit".   In  the  case  of  an  0(n)  process,  this  is 
first  done  by  determining  if  the  memory  bandwidths  are  adequate  for  the 
algorithm  and  processor.   If  they  are,  the  sizes  of  buffer  and  primary 
memory  are  next  considered.   In  this  case  buffer  memory  must  be  equal  to 
20     but  the  size  of  primary  memory  is  dependent  upon  (among  other  para- 
meters) the  block  size.   Thus,  data  block  size  may  be  adjusted  to  obtain  a 
fit  in  primary  memory. 
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For  0(tr)  processes,  r  >  1,  altering  the  block  size  has  the  addi- 
tional effects  of  altering  the  required  disk  and  buffer  memory  bandwidths  and 
the  buffer  memory  size.   For  programs  containing  such  processes  one  can  compute 
or  plot  the  values  of  these  parameters  versus  block  size  to  obtain  those  values 
for  the  block  size  which  are  physically  realizable  for  the  existing  system. 

6.2  System  Design  Applications  of  Latency  Elimination  Techniques 

When  designing  a  system  which  is  to  employ  the  latency  elim- 
ination techniques  presented  in  this  paper,  the  relationships  between  the 
various  system  parameters  are  known  and  all  of  the  trade-offs  may  be  con- 
sidered.  Thus,  for  example,  the  required  disk  and  buffer  memory  bandwidths 
may  be  related  to  the  processor,  process  and,  (for  an  0(lT  )  process,  r  >  1), 
block  size.   The  class  of  all  processes  which  may  be  run  on  the  system  could 
be  provided  for  by  providing  a  memory  bandwidth  as  large  as  the  largest  re- 
quired.  That  is,  a  system  may  be  designed  for  a  particular  type  of  program 
(as  many  special  purpose  computer  systems  frequently  are)  or  for  many  types 
of  programs.  We  need  only  estimate  such  parameters  as  the  upper  bounds  of  the 
speed  and  size  over  the  entire  program  mix. 

For  systems  "tuned"  to  a  particular  class  of  programs,  the  relation- 
ships between  the  system  parameters  can  be  used  to  obtain  an  optimal  memory 
heirarchy  configuration  as  in  the  manner  in  which  memory  cost  was  minimized 
in  Section  5-8. 

6. 3  Areas  for  Future  Research 

Some  further  research  which  is  suggested  by  that  in  this  paper  is  a 
generalization  of  our  techniques  to  moveable  arm  disks.   Some  obvious  con- 
straints (e.g.  requiring  a  minimum  wait  of  the  longest  possible  head  positioning 
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time  between  i/O  operations)  will  allow  the  application  of  our  techniques 
directly,  but  more  sophisticated  refinement  might  yield  a  more  practical  model 
of  such  devices . 

Preplanning  techniques  such  as  ours  could  perhaps  be  generalized  to 
handle  situations  with  imperfect  information  on  i/O  operations.   This  would 
permit  the  same  type  of  latency  elimination  schemes  with  just  statistical  in- 
formation on  the  i/O  sequences .   This  might  be  done  by  storing  several  parti- 
cular data  block  sequences  at  particular  angular  positions  on  the  disk  when 
it  has  been  predicted  that  one  of  them  (but  not  which  one)  will  be  needed  at 
that  point.  A  method  which  was  effective  for  this  type  of  imperfect  infor- 
mation would  require  a  less  sophisticated  analyzer  than  that  of  the  methods 
in  this  paper  to  extract  the  i/O  information  before  run  time,  and  would 
probably  also  be  more  useful  for  a  multi -programming  environment. 

The  methods  employed  to  effect  the  permutations  of  data  blocks  can 
be  applied  to  memory  configurations  to  obtain  hardware  methods  of  reordering, 
retrieving  and/or  storing  data  in  real  time.   Thus  data  could  be  made  to 
migrate  on  a  disk  through  basically  hardware  means. 
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